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Preface 



The PaCT-2001 (Parallel Computing Technologies) conference was a four-day 
conference held in Akademgorodok (Novosibirsk), September 3-7, 2001. This 
was the sixth international conference in the PaCT series, organized in Russia 
every odd year. 

The first conference, PaCT-91, was held in Novosibirsk ( Academgorodok) , 
September 7-11, 1991. The next PaCT conferences were held in Obninsk (near 
Moscow), August 30 - September 4, 1993; in St. Petersburg, September 12-15, 
1995; in Yaroslavl September 9-12, 1997; and in Pushkin (near St. Petersburg) 
from September 6-10, 1999. The PaCT proceedings are published by Springer- 
Verlag in the LNCS series. 

PaCT-2001 was jointly organized by the Institute of Computational Mathe- 
matics and Mathematical Geophysics of the Russian Academy of Sciences (Novo- 
sibirsk), the State University, and the State Technical University of Novosibirsk. 

The purpose of the conference was to bring together scientists working with 
theory, architecture, software, hardware, and solution of large-scale problems 
in order to provide integrated discussions on parallel computing technologies. 
The conference attracted about 100 participants from around the world. Au- 
thors from 17 countries submitted 81 papers. Of those submitted, 36 papers 
were selected for the conference as regular ones; there were also 4 invited pa- 
pers. In addition there were a number of posters presented. All the papers were 
internationally reviewed by at least three referees. As usual a demo session was 
organized for the participants. 

Many thanks to our sponsors: the Russian Academy of Sciences, the Rus- 
sian Fund for Basic Research, the Russian State Committee of Higher Educa- 
tion, the European Commission (Future and Emerging Technologies, Directorate 
General-Information Society) for their financial support. Organizers highly ap- 
preciated the help of the Association Antenne-Provence (France). 
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to Reaction-Diffusion Processes Simulation 



Olga Bandman 



Supercomputer Software Department 
ICMMG, Siberian Branch 
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Pr. Lavrentieva, 6, Novosibirsk, 630090, Russia 
bandmanSssd . sscc . ru 



Abstract. A hybrid approach for simulating reaction-diffusion pro- 
cesses is proposed. It combines into a single iterative procedure Boolean 
operations of Cellular Automata Diffusion with real number computa- 
tion of nonlinear reaction function. The kernel of the proposed approach 
is in constructing methods for transforming reals into spatial distribu- 
tion of Boolean values. Two algorithms are proposed and illustrated by 
the simulation of some well studied typical reaction-diffusion phenom- 
ena. Computational features of the methods are discussed and problems 
for future research are outlined. 

1 Introduction 

There is a number of well known Cellular Automata diffusion and Gas-Lattice 
models [1,2,3], as well as some trials to find cellular automata simulating kinetic 
and chemical processes. Following [4], all these models should be considered 
as “alternatives rather than approximations of Partial Differential Equations 
(PDF) solutions”. These discrete models have a number of computational ad- 
vantages, the most important being the absolute stability of computation and 
the absence of rounding off errors. These properties attract the mathematicians, 
while the specialists in chemistry, biology and physics are interested in creat- 
ing models of phenomena, which have no mathematical description at all. Such 
Cellular Automata (CA) are constructed on the basis of kinetic or chemical mi- 
croscopic dynamics. Boolean cell states simulate the existence or the absence 
of an abstract particle (molecule, velocity component, concentration, etc.) at 
certain points of time and space. Cell operations are represented as Boolean 
functions of states in the cell neighborhood. To obtain physical interpretation of 
Boolean results, a sum of state values over an area around each cell is calculated. 
Two prominent examples are a deterministic chemical CA, proposed in [5], and 
a “Stochastic Cellular Automaton” from [6], which are intended for simulation 
chemical processes in active media. In [7] a reaction-diffusion CA is presented, 
based on a neurolike model, whose elementary automaton executes a threshold 
function and has a refractory period after the active state. In [8,9] many very 
interesting industrial application of Cellular-Automata models are presented. 
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An important problem not yet completely solved in the above approaches 
is to prove the correspondence of the cellular array evolution to the modeled 
phenomenon, as well as the way of accounting physical parameters (density, vis- 
cosity, diffusion coefficient, pressure, etc) in the array function parameters. The 
most correct approach to solve these problems might be a natural experiment 
which, however, is impractical. But such experiments are sometimes impractical. 
Certain particular results have been obtained theoretically for the CA-diffusion 
with Margolus neighborhood [2] and for Gas-Lattice FHP-model [10]. In both 
cases the proofs of the CA evolution correspondence to the modeled phenomenon 
are done by reducing the CA to the PDE of the modeled phenomenon. 

There are many problems also in studying reaction-diffusion processes by 
PDE analysis. They are investigated literally by the piece (equations of Gordon, 
Fitz-Nagumo, Belousov-Zhabotinsky, etc.), and with much difficulty, because an- 
al 3 dical solutions are impossible due to the nonlinearity, and numerical methods 
are limited by stability and accuracy problems [12,13]. 

Unfortunately up to now no method is known for determining a CA-model of 
process when its PDE description is known. The latter is a system of first order 
PDEs, having in their right sides two additive terms: 1) a Laplacian to represent 
the diffusion, and 2) a nonlinear function to represent the reaction (in chemistry) 
or the advective process (in hydrodynamics), phase conversion (in crystalliza- 
tion), population evolution (in ecology). The first is perfectly modeled by CA, 
and the second is easy to count without the danger to make the computation 
unstable. 

From the above it follows, that it makes sense to find methods which combine 
CA-diffusion with calculation of reaction function in reals. We propose to state 
the problem as follows: given a reaction-diffusion PDE, a discrete cellular algo- 
rithm is to be constructed whose evolution approximate that of finite-difference 
PDE. Obviously, it should be an iterative algorithm, at each step performing 
the operation of transforming spatially distributed Boolean values into the av- 
eraged and reals and the inverse operation referred to as allocation procedure. 
The latter is precisely the most crucial point of the algorithm. Thus, we propose 
to exploit well studied CA-models of a diffusion [3] combining it with the integer 
approximation of reaction function. 

The motivation for such an approach contains two arguments. The first is 
based on the wish to use the great experience of nonlinear phenomena study by 
PDE solving. The second reason is to obtain rather simple discrete models to 
replace PDEs, the solution of which is sometimes impractical. We do not know 
attempts to use such an approach, so we shall try to fill the gap. 

To give a mathematical background of the proposed methods the formalism 
of Parallel Substitution Algorithm (PSA) [14] is used, which allows to combine 
real number and Boolean computation in a unique iterative process. 

Apart from Introduction and Conclusion the paper contains four sections. In 
the second section main concepts and formalisms used in the paper are presented. 
The general scheme and two algorithms of transforming PDE into a discrete 
cellular automaton are presented in the third section. In the forth section the 
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computer simulation results are given. In the short fifth section the properties 
of proposed methods are discussed and problems for future investigation are 
outlined. 



2 Continuous and Discrete Forms 

of Spatial Dynamics Representation 



2.1 Reaction-Diffusion Partial-Differential Equations 



Let us consider reaction-diffusion process as a function of concentration of a 
certain substance of time and space. The traditional representation of the most 
simple one-dimensional reaction-diffusion process has the form of the following 



PDE; 



du 

dt 



<((£) + CM 



( 1 ) 



where m is a variable with the normalized domain from 0 to 1, t,x are contin- 
uous time and space, d is a diffusion coefficient, F[u) a differentiable nonlinear 
function, satisfying certain conditions, which in [11] are given as follows. 



E(0)=E(1)=0; F{u)>0 ifO<M<l; 

F'{0) = a; a>0; F\u) < a; if 0 < m < 1; ^ > 

The conditions (2) are met by a second order polinome (Fig. la) of the form 

F{u) = au[l — u); (3) 





Fig. 1. The nonlinear functions used in typical reaction-diffusion equation 



Equation (3) describes also the propagating front of the autocatalitic reaction 
(Field-Noyes model [15]). The equation (1) with F{u) like (2) is studied in details 
[11,14]. It is known, that with the initial conditions 



u(xO) = l^ 



( 4 ) 
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it generates an autowave of the type propagating front, which moves (at t — ^ oo) 
with the velocity 

V = 2\fda, (5) 

In ecological research functions satisfying (2) are classified as logistic ones 
and considered to be basic, although some others are also studied, for example, 
those represented by third order polinomes (Fig. lb), such as 

b\u) = au[l — u)[u — k) , 0 < k < I, (6) 

which meet the following conditions: 

F’(0) = F{k) = F{1) =0, 0 < A; < 1; 

F[u) < 0 if 0 < u < k; , , 

F{u) > 0 if A; < M < 1; ^ 

F'{0) < 0, F'{k) > 0, F'(l) < 0, 

With F{u) of the form (6) the propagating front velocity is 

V = y^{l-2k) ( 8 ) 



Moreover, when the initial condition have the form 



J Mo if \x\ < I, k < uo < 1, 

1^ 0 if \x\ > I, 



(9) 



referred to as a “flash”, then the wave may attenuate, if F[u)jnax is not suffi- 
ciently large. 

The above analMical characteristics of some simple and well studied reaction- 
diffusion phenomena are further used for comparing them with the similar ones 
obtained by simulation of CAs. Obviously, their correspondence would confirm 
the correctness of the proposed method. 



2.2 Prirallel Substitution Algorithm for Discrete Cellular Simulation 

Parallel Substitution Algorithm (PSA) [14], is a convenient formalism for repre- 
senting spatially distributed processes. It suits well to be used for our purpose, 
due to the fact that it allows to deal both with Boolean and real data. The 
following properties of PSA make it powerful for this purpose. 

• PSA processes cellular arrays, which are sets of cells given as pairs 
C[A,M) = {(a, m)}, where a € A is a cell state, and m € M is a cell name. 
A - is an alphabet (in our case it is Boolean or real). M is a naming set (in 
general case a countable one). On the set M naming functions fg : M ^ M are 
defined. The naming set is the set of discrete Cartesian coordinates, given as 
m = (i,j, k). In our case only shift naming functions are used. A set of namimg 
functions form determines the names of any cell neighborhood. 

• Operations over a cellular array are specified by a set <P = {0i},i = 
1, ... ,n, of parallel substitutions of the form 

Oi : Ci{m) =K Si{m) -t S-{m). 



( 10 ) 
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where 

Ci{m) = {{yik,4>ik{m)) : k = 0, . . ,,qy}, 

Si{m) = : j = 0,.. .,qx}, (11) 

^%{^) = :j = 0,...,qa,}, 

In (10,11) Gi[m) ^ Si[m) and Sj[m) are local configurations^ =i< meaning their 
union for any m £ M . Further only stationary parallel substitutions are used, 
in which the neighborhoods of Si[m) and S[[m) are formed by identical sets 
of naming functions, which contain an identical naming function ffm) = m 
referred to as a central cell of the substitution. A parallel substitution should 
meet the following conditions: 

1) no pair of naming functions values in (11) are equal, 

2) Xij € X, yik € Y are state variables or constants and fu{X, Y) are cellular 
functions with the domain from A. 

• A substitution is applicable to C(A,M), if there is at least one cell 

named m € M such that Ci(m)u5i(m) C C(A, M). Application of a substitution 
at a cell (a,m) £ C[A,M) yields changing cell states in Sjfm) called the base 
by the corresponding ones from the set of cells Cifm) (called a context) 

remaining unchanged. 

• There are three modes of parallel substitutions application. 

1) Synchronous mode, when at each step all substitutions are applied at all 
cells at once. At this case in order to provide determinism of the computation, one 
should be careful not to allow the substitutions be contradictory when |S'((m)| > 
1 [14]. 

2) Asynchronous mode, when any substitution is applied at any cell, one 
application being allowed at a time. There is no danger of contradictoriness in 
this case, but a generator of random numbers should be used to determine a 
next cell to which the substitutions are to be applied each time . 

3) 2-step synchronous mode, when cellular array under processing is to be 
partitioned into two parts, and at each time-step the substitutions act at one of 
them only. 

• A Parallel Substitution Algorithm (PSA) is a set of substitutions to- 
gether with indication of the mode of application. Implementation of a PSA over 
a cellular array C is an iterative procedure, where at each step the substitution 
set is executed at a set of cells, according to the given mode. The algorithm 
stops when no substitution is applicable to the array. 

• A PSA may process not only one but a number of interacting arrays 
C = {Ci,...,C„} as well. In the latter case each substitution is allowed to 
be applied to only one array. It means that its base Sifm) is located in only one 
Cl £ C, i.e. m £ Ml- As for the context Gi[m), it may be located at any array, 
moreover, it may be composed of a number of local configurations, located in 
different arrays, i.e. 



Ciim) = Ciijni) * . . . * k<n;mj£Mj. (12) 

PSA is further used to represent reaction-diffusion processes by discrete fine- 
grained parallel algorithms. 
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3 Combining CA-Diffusion 

with Finite-Difference Reaction 

3.1 General Scheme of the Computational Process 

Without loss of generality let’s consider the two-dimensional case. After time and 
space are transformed to the discrete form resulting in x = hi, y = hj,t = nr, 
where i,j,n are integers, h=l, the equation (1) looks as follows. 

Uij{t + 1) = uh + TdL{uh) + Tb\uh, (13) 

where L[u'-) is a Laplacian, Uijf = Uij/(t) are variable values in real numbers. 

Let us now focus on the most discrete form of process representation, when 
concentration values are given in Boolean form, i.e. u G {0,1}. Using the PSA 
notation we consider coordinates i,j as cell names {i,j) € M , and concentration 
values as cell states a e A, the process to be simulated being given by a set of 
parallel substitutions acting on the cellular array C G AxM . The correspondence 
between continuous and discrete forms of representation is that MU(t+ 1) and 
u'-it) are averaged values over a certain area Aviij) around a cell named {i,j), 
referred to further as averaging area, 

“U = “U- (14) 

Av{ij) 

When F{u) = 0, then (1) and (13) describe a “pure diffusion” which has 
some simple and well studied CA-models. The most known of them called Block- 
Rotation CA-model is theoretically proved [2] to be equivalent to Laplace equa- 
tion with d = 3/2 (in 2D case). Moreover, in [3] it is shown how to use the model 
with any diffusion coefficient. 

The above approves the possibility to decompose each step of the iterative 
simulation procedure into three operations: application of a CA-diffusion rule, 
computation of the reaction function and combining the result. 

Accordingly, the array C under simulation is partitioned into three parts: 
diffusion array Cd with Boolean cell states, reaction array Cu and resulting 
array C both with real cell states, the naming sets of the parts being in one-to- 
one correspondence. 

CA-diffusion rules are applied to the diffusion array resulting in Gij{t A 1). 
As for the reaction function computation, it may be accomplished only in reals, 
resulting in Gfi[t +1). At last, to obtain the next state both results should be 
combined in such a way that the result of the t-th iteration satisfies the following 
conditions: 

Gft+l) = {Gf{t + l))®GH.{t+l), G\t+1) = Av{C\t + l)), (15) 

where ”©” means cell-wise states summing, Av[GD[tAl)) has states obtained by 
(14) applied to corresponding states of Gu{tAf)- Both cellular arrays G(t+l) and 
C"(t+1), representing t-th iteration result in Boolean and real form respectively, 
are the initial arrays for the next iteration (Fig. 2). 
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Fig. 2. General scheme of one iteration of the iterative hybrid method of simulation 
the reaction-diffusion process given by an equation of the form (13) 



The main problem to be solved for constructing the computational algorithms 
according to the general scheme given in Fig. 2 is to find the procedure which 
is inverse to averaging. It comprises the distribution of “ones” over the cellular 
array in a way, that guaranties given averaged values and is referred to further 
as allocation procedure. Allocation is just the problem that constitutes the kernel 
of the proposed approach. Two allocation procedures determine two algorithms 
for combining Boolean and real computations into a single iterative procedure. 
The first is called a multilayer method. It requires one dimension to be added to 
the diffusion cellular space. So, the array is treated as a multilayer one. An iter- 
ation of CA-diffusion algorithm is executed in all diffusion layers independently, 
and the averaging and the allocation is performed over the subarray, containing 
corresponding cells names, i.e. differing only by the layer number. The second 
method is referred to as a three-layer method. In it the CA-diffusion is performed 
in only one layer of the array where the averaging and allocation is performed 
over the averaging area which contains cells whose spatial coordinates differ in 
no more than a constant p, referred to as a radius of averaging. The allocation 
is done by inverting the cell state with a probability, depending of the number 
of bits to be allocated and the neighborhood size. 

In the following subsections the above methods are given formally in terms 
of PSA. CA-diffusion algorithms used in the methods are not discribed there, 
they are given in brief in the examples of section 5. 

3.2 A Multilayer Reaction-Diffusion Method 

To simulate a reaction-diffusion process in an n-dimensional space an (n+1)- 
dimensional cellular array should be taken, which is further considered as a 
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multilayer n-dimensional one. So, the naming set of the array in 2D case is 
M = k)}, where i, j = ... ,0,1,2, - are coordinates of 2D infinite space, 

further referred to as spatial coordinates, and fee {0, . . . , L} is a la,yer number. 
The naming set is partitioned into two subsets: M = MdUMr, Mr = {{i,j,0)} 
contains the names of the zero layer cells, Mr - the cell names of all other layers. 
Mr in its turn is partitioned into subsets of names, differing only by the layer 
numbers: Mr = UijMij, where Mij = {{i,j,k) : k = 1, . . . , L} is the average 
area forming the averaging subarray Cij of the cell, with spatial coordinates 

ihj)- 

Cells with names from Mr and Mr have the state alphabets Ar = {0,1} 
and Ar =R, respectively. 

The subarray Cr = {(u, {i,j,0)) : v € Ar, {i,j,0) G Mr plays a twofold role: 
it is destined for computing the nonlinear function and storing the averaged 
result. In each A;-th layer from the diffusion part Cr = \{u,{i,j,k)) : u G 
Ar, {i,j,k) G Mr} one of the chosen 2D CA-diffusion algorithm is realized. 

The scheme of the multilayer algorithm is as follows. 

Given an initial cellular array C'(O) = ^^(O) UC'jj(O) of finite size G x H x L 
where M = {{i,j, k) : i = 0,1, . . . ,G — 1, / j = 0,1, . . . , H — 1, /k = 0, . . . L}, 
and cells of diffusion layers [k = 1, . . . ,L) have Boolean states, zeros and ones 
being distributed over the layer in such a way that the number of ones in the 
averaging subarray of the cell is equal to the initial concentration of the substance 
in the space under simulation. These concentration values Vij = given in 
real numbers are cell states of the reaction layer (A; = 0). 

The computation is an iterative procedure, each t-th iteration being com- 
posed of the following steps. 

• Step 1. In all diffusion layers an iteration of CA-diffusion transition 
rule is performed. It results in changing the cell states of CR^t), i.e. 

CR{t+l) = Dif{CR{t)). (16) 

It should be noted, that at this step there is no interactions between the layers. 

• Step 2. In each cell [v,{i,j,0)) G Gr the nonlinear function F[v) is 
computed and the nearest integer to the result becomes the cell state. 

: {(u,(i,j,0))} ^ {(j/,(i,j,0))}, where y = Int(F(w)). (17) 

• Step 3. Allocation operation is performed as follow. In each subset 
Gij G Gr the amount of cells equal to the state of the cell [y, {i,j,0)) is inverted 
according to its sign. If y > 0, then cell states u = 1 are inverted, else, if y < 0, 
the same is done with cell states m = 0. 

: I(|y| > 1, (i, j,0))} * {{u,{i,j,k))} {{u,{i,j,k))} , . 

Os : {{\y\> l,{i,j,0))} ^ {{\y\ - l,{i,j,0))}, 

k ranging from 1 to L. 

Allocation operation results in the subarray CR(t + 1). 

• Step 4- Averaging operation over all Mij is performed according to 

( 14 ). 

<94 : {{uiAioA}), - ■ ■ * {{v,{i,j,0})} {{u' , {i, j,0})} (19) 
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Vj ^ ^ '^ijk ? 
k = l 

Averaging operation results in the subarray C'/i(t+ 1). 

• Step 5. If the whole computation is not completed, i.e. t + 1 <T , then 
the subarrays obtained in step Sand 4 are taken as the initial ones for the next 
iteration, else the computation is considered to be completed and C(t + 1) = 
C[){t + 1) U CR^t + 1) is its result. 

In section 4.1 this algorithm is illustrated by simulation results of two types 
of autowaves: ID and 2D propagating fronts. 



3.3 Three-Layer React ion- Diffusion Method 

Three-layer hybrid method provides for a three-layer array, whose spatial coor- 
donates together with the layer number (k = 0,1,2) form the naming set. For 
definiteness 2D-case is further considered. Let the layer with A; = 0 be the re- 
action subarray Cr = {(y, (i,y,0))}, the layer with k = I - the diffusion layer 
C'd = {{u, {i,j, 1 ))} and the last one with k = 2- the layer storing the averaged 
result it contains and counts the averaged diffusion results C = {(w, (i,y, 2 ))}- 
Respectively, u € AD,y,v € Ar. Averaging is performed over the neighborhood 
Q — {(* + + ^;1) • h,l = — r, . . . ,r}. Initially C'(O) and ^'(O) contain the 

Boolean distribution and averaged values of concentration at t = 0, Cr[0) has 
zero-states in all cells. 

The simulation procedure is an iterative one with the t-th iteration consisting 
of the following steps. 

• Step 1. In the diffusion layer an iteration of a CA-diffusion algorithm 
is executed resulting in 

Cnit+i) = ■■i = 0,...,M -l;j = 0,...,N -1}. 

Step 2. In the cells of reaction subarray Cr the nearest integer to the function 
S\v) value is computed according to the substitution 

: {{v,{i,j,2})} * {(0, (i,j,0))} ^ {(y,(i,j,0))}, where/y = Int{F{v)). (21) 

• Step 3. Allocation operation is performed by inverting cell states in the 
diffusion layer according to the following probabilities. 

P = = 0 & y >0 

p' = y jv \i y < Q. ^ 

Two parallel substitutions executing this operation are as follows. 



6>6 : {(w,(i,j, 2 )),(y,(i,j, 0 ))} * {(«, (i, j, 1 ))} 

6>7 : {(f,(bj, 2 )),(y,(i,j, 0 ))} {( 0 , (i,j, 2 )),( 0 , (i,j, 0 ))}, 
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where 

N ^ f 1, if(M = 0)&(y > 0)&(rand(l) < p); 

1 0, if(u = l)&(y < 0)&(rand(l) < p'); ^ ' 

• 4- Averaging operation is performed in all cells of the averaging 
subarray C'jj according to the substitution 

<9s : - r, j - r, 1)), ■ ■ ■ , (Mi+r,i+r(f + r, j + r, 1))} * {(0, 

-t [v, (i, j,2))}, 

(25) 

where 

h=r l=r 

^i+h,j + l (26) 

h=—r l = — r 

• Step 5. If the whole computation is not completed, i.e. t + 1 < T, then 
the subarrays obtained in steps 3 and 4 are taken as the initial ones for the next 
iteration, else the computation is considered to be completed and C(t + 1) = 
Cn{t + 1) U C" is its result. 

The above method of allocation is approximate. It is absolutely accurate only 
in case of uniform probability distribution over the cell neighborhood. In case 
of pij variation, the expectation Ai of the event Y meaning that the number of 
inversed cell-states in the neighborhood is pij is equal to 

- Vij- 
Q 

Moreover the approximation error approaches to zero when the deviations are 
of different signs. At any case some corrective coefficients may be provided to 
reduce the error to any small value. 

4 Computer Simulation Results 

4.1 Simulating ID Propagating Front by Multilayer Method 

As it was mentiomed above, the use of hybrid simulation methods suggests to 
choose appropriate CA diffusion model to be included in the algorithms. 

Comparative analysis of CA diffusion models presented in [3] allows to make 
the following conclusion: the model, called a naive CA-diffusion should be used 
for ID-case, and the Block-Rota, tion method (BR-method) is the best for the 2D 
one. The diffusion coefficient of naive CA-diffusion is not known. So, it has been 
obtained by simulation by comparing the results with those obtained solving 
PDE, the result being d = 1.1. So, the hybrid method of one-dimensional prop- 
agating front combines ID naive CA-diffusion [3] with the nonlinear function of 
the form (3). 

Naive CA-diffusion is the most simple model of equalizing the concentration 
by local stirring along one direction. Let it be the direction along the axis j 
of the ID [L + l)-layer cellular array G = {{ujk, {j, k)) : j = 0,1, ■■■ ,G;k = 
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0, . . . , L}, C = Cd U Cr . The diffusion subarray Cr, which contains the layers 
[k = 1,...,L) uses the alphabet A = {0,1}, the variables being specified by 
Ujj.. Naive CA-diffusion dictates to each cell to exchange states with one of its 
randomly chosen neighbor. To avoid contradictoriness the asynchronous mode 
of execution is used. It means, that at each time only one (randomly chosen) 
pair of cells exchanges states. So, one iteration which corresponds to Step 1 of 
the general scheme comprises G x L times, each time the following operations 
should be executed. 

1) Two random numbers j,k, 0<j<G,l<k<L, are obtained. They 
indicate the cell, to which the algorithm is applied. 

2) A context cell (a, mo) is introduced to indicate the neighbor with 
whom the cell should interact. The neighbor is determined according to the 
probability p = 1/2. So, if a random number r < 1/2, (0 < r < 1), then a = 1, 
which means, that the neighbor to interact with is at the right side of the cell 
{i,j) . If r > 1/2 then a = 0 and the left neighbor is chosen. 

3) The following parallel substitutions are applied to a chosen cell of Gr. 

Og : (l,mo) * |(m, {j,k)),{u', {j + 1, A;))} {(«', {j,k)),{u, {j + 1, A;))}; 

(27) 

OiO : (0,m) * {{u,{j,k)),{v/,{j - 1,A;))} {{v/ , {j,k)),{u, {j - 1,A;))}; 

The other steps are executed in complete accordance with the general scheme 
(section 3.2). The difference is only in the absence of coordinate i in the names. 




Fig. 3. A pair of snapshots of the profiles of ID propagating front, obtained by the 
multilayer method with naive asynchronous CA-diffusion and and nonlinear function 
of the form (3) 



In Fig. 3a two profiles of ID propagating front are given. They have been ob- 
tained by simulation using the multilayer algorithm with N = 128, L = 32, F[u) 
of the form (3) with a = 1.2 Propagation velocity has been determined by an- 
alyzing the fronts profiles, obtained from the series of similar experiments. The 
coincidence with that, obtained by formula (5) is in the limits of the accuracy 
of the experiment. For example, in accordance with the well known character of 
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propagating front behavior [11,14] its propagating velocity decreases with time, 
approaching (according to (5)) at t = oo to Iq = 2.3 (with a = 1.2, d = 1.1). 

4.2 Simulating 2D Propagating Front by Multilayer Method 

The algorithm to be presented here combines the 2D BR-diffusion with a non- 
linear functions (3) and (6). BR-diffusion, which is referred to in [2] as a CA 
with Margolus neighborhood, works in a two-step mode of execution over a cel- 
lular array G = {(m, {i,j, k)) : i = 0, . . .G — l,j = 0, . . . , H — l,k = 0, . . . , L}. 
In the diffusion layers {k = 1, . . . , L), two types of cells are distinguished: even 
cells and odd cells. Even cells have both i and j even, odd cells have both i 
and j odd. A cell from the even (odd) subset is considered to be a central one 
for each fc-th layer block B{i,j,k) = A l,fc)),(w 3 ,(f + 

l,j A l,k)),[u 4 ,(i A l,j, ^)}- Similar blocks are formed by cells belonging to 
the even subset. Clearly, the intersections between blocks from one the same 
subset are empty, from which follows the noncontradictoriness. To indicate the 
type of blocks an additional context cell (/?, mo) is introduced, /? = 0, /? = 1 
corresponding to even and odd cells, respectively. 

Each diffusion iteration (Step 1) consists of two times: at the even time even 
blocks turn with probability p= Xj^X.o'K j2 either clockwise or counterclockwise. 
To indicate the rotation direction an additional context cell ( 7 , mi) is introduced. 
If 7 = 1 then the rotation is clockwise, else - counterclockwise. In PSA notation 
it looks like this. 

6>ii : {(l,mo)(l,mi)} * {{ui, {i,j,k)),{u 2 , {i,j A l,k)),{u 3 , {i A l,j + l,k)), 
(m 4 , {i A l,j,fc))} {(m 4 , {i,j,k)),{ui, {i,j A l,k)),{u 2 ,{i A 1, j + l,k)), 

(U 3 , {i A l,j,k))}; 

012 : {(l,mo)(0,mi)} * {{ui, {i,j,k)),{u 2 , {i,j A l,k)),{u 3 , {i A l,j A l,k)), 
(m 4 , {i A l,j,fc))} {{U 2 , {i,j,k)),{u 3 , {i,j A l,k)),{u 4 ,{i A 1, j + l,k)), 

{ui, {i A l,j,fc))} 

(28) 

Eor the odd times the substitutions differ from (30) only by the context cell 
(/?,mo), which is in this case (0, mo). Other steps are in accordance with the 
general scheme of the algorithm. 

2D propagation front simulation has been done for the cellular array with 
G = H = 64, L = 32 with two different nonlinear functions, given by (3) and by 
( 6 ). The initial array had the cells states m = 1 in the cells from {(m, {i,j,k) : 
{G/2 -g) <i < [G/2 A g)] [H/2 - g) < j < {H/2 A g)] k = I, . . . , L}, g = 6, the 
rest of cells had m = 0. Such initial states are referred to as a. flash in the ecological 
research. In Eig. 4a two snapshots of front profile propagating from the flash are 
shown, the reaction function being of the form (3) with a = 1.2. In Eig. 4b two 
snapshots are shown, which are obtained simulating diffusion-reaction with the 
function F{u) of the form given by ( 6 ) having Fmax{u) not sufficiently large to 
support tyhe front to propagate. In this case the flash diminishes and disappears. 

In Eig. 5 a diffusion two propagating towards each other fronts array shown 
after T = 16 iterations (Eig. 5b), being initiated by two dense spots (Eig. 5a). 
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Fig. 4. Snapshots, obtained by simulating 2D reaction-diffusion with F[u) of the form 
(6). a) propagating front profiles, b) profiles of a damping hash 




a b 

Fig. 5. The initial state and the snapshot at 'I' = 16 of two propagating fronts 



4.3 Simulating ID Propagating Front by Three-Layer Method 

Let us combine naive 2-step CA-diffusion with reaction function of the form (3). 
The array has three ID layers: CR,[k = 0), G]j[k = 1) and C ,[k = 2). One- 
dimensional 2-step naive diffusion is similar to the Block-Rotation diffusion. 
The array is partitioned into two subsets: a subset of even cells, which form even 
blocks: {(«!, (y, 0)), (u 2 , (j + l? 0)) '■ j = 0? 2, . . . , A^ — 2}, and a subset of odd ones: 
(T t*)), (tt 2 ; (j + 1; 0)) '■ j = 1, 3, . . . , A — 1}. Each diffusion iteration consists 
of two times: at even time cells of even blocks exchange states . The same do 
the odd blocks at odd times. Substitution of state exchange is as follows. 

<9i3 : {(«!, 0',0)),(m 2, (j + 1,0))} -1 {(M2,(T0)),(Mi,(j + l,0))}. (29) 

The averaging (step 2) is performed using 6>g, neighborhood size being chosen 
according to the required accuracy. Reaction function (step 2 of the algorithms 
in section 4.3) is computed in cells of Gr according to 02- 

The allocation procedure is performed in the cells of Gr with the account 
of corresponding probabilities counted by (22) and (23). In Fig. 6 two snapshots 
of a propagating front obtained by this method are shown, propagation speed 
being in good accordance with the theoretic value. 
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Fig. 6. Two snapshots of propagating front profiles, obtained by stochastic method 



5 Characterisation of the Proposed Method 

To make a conclusion about the proposed method of reaction-diffusion phenom- 
ena simulation, its computational characteristics (performance, accuracy, stabil- 
ity) should be assessed. The most correct way to do this is to compare them to 
the similar ones, for PDE solution. Among the scope of PDE solution methods 
the finite-difference one is chosen, because of the two following reasons. The first 
is its fine-grained parallelism, which is considered as a very important feature, 
due to its simplicity and capability of decomposition. The second is its simula- 
tion similarity to a CA evolution, which allows to consider the comparison to be 
correct. 

On this stage of proposed method development only qualitative comparison 
is possible. The qualitative assessment may be done after a long and hard work 
both theoretical and experimental. Here only some considerations can be applied 
to the problems of the above characteristics determining. They are as follows. 

Accuracy. The accuracy is determined by two types of errors. The round off 
errors and approximation errors. Round off errors are very small in the proposed 
method, because the CA-diffusion is absolutely free of them, as for averaging and 
reaction function calculation they might be done in integers. Approximation 
errors emerge in the procedures of counting the reaction function, as well as in 
the stochastic updating cell neighborhood. Since these errors depend on the size 
of the averaging space, the price for the accuracy is the size of the array. 

Stability of the computation. According to the finite-difference PDE theory 
the stability of the iterative algorithm to compute Laplace operator is condi- 
tioned by a relationship among a diffusion coefficient, and spatial [h) and time 
discretization (r) steps. To meet these conditions the steps are to be chosen 
sufficiently small, which yields in large time of computation. When the pro- 
posed CA-method is used, no restriction on computation stability is imposed. 
The choice of discretization steps is done according to required accuracy and 
smoothness of the resulting dependencies (absence of “CA noise”). 

Performance, which is the inverse of the time, needed to solve a given problem 
may be assessed by three time parameters: 1) number of iterations for reaching 
the result, 2) number of elementary operations in each iteration, and 3) number 
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of cells (grid nodes in PDE) needed to provide the required properties. Each of 
the above time parameters needs to be of assessed with the account of many 
factors: kind of nonlinearity, spatial dimension, number of variables, etc. At this 
stage of study the following remarks may be done. The amount of iteration is 
expected to be less than when the corresponding PDE is solved. The number 
of elementary operations, the amount of bit-operations should be computed and 
compared to that needed for PDE solution. It is obvious, that the amount of 
bit operations is less for CA-diffusion in each iteration, but the averaging and 
updating procedures may override the difference. As for the size of the array, on 
one hand it can be smaller than for the PDE solution because the value of h is 
not restricted by stability conditions, on the other hand, it should be larger to 
diminish the CA-noise. 

The above considerations show that a correct comparison of CA-model with 
finite-difference solution is a separate hard task, which may be solved basing on 
the considerable body of experience. This task is most likely to be formulated 
as follows: for certain class of reaction-diffusion PDEs the domain of parameters 
should be found such that the proposed approach is preferable. 



6 Conclusion 

Two methods for constructing CA representation of reaction-diffusion PDE are 
presented, both using the approach of combining known CA diffusion models 
with conventional computation of nonlinear function. The methods are illus- 
trated by simulation on array of limited size. Simulation results coincide with 
those known from the corresponding PDE analysis. Some considerations are ap- 
plied to the assessment of the approach and future investigations, among which 
the most important is to promote applications and accumulate experience. 
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Abstract. The specification and verification of shared-memory multi- 
processor cache coherence protocols is a paradigmatic example of paral- 
lel technologies where formal methods can be applied. In this paper we 
present the specihcation and verihcation of a cache protocol and a set 
of formalisms which are based on ‘process theory’. System correctness is 
not established by simple techniques such as testing and simulation, but 
‘ensured’ in terms of the underlying formalism. In order to manipulate 
the specihcation and verify the properties we have used an automated 
tool — namely the ‘Edinburgh Concurrency Workbench’ (CWB). 



1 Introduction 

Eormal methods are mathematically based techniques for specifying and verify- 
ing complex hardware and software systems [3]. This paper emphasizes their 
application to parallel processing and distributed computing systems, where 
the main source of complexity is due to the co-existence of multiple, simul- 
taneously active, and interacting agents. The specification and verification of 
shared-memory multiprocessor cache coherence protocols is a paradigmatic ex- 
ample of parallel technologies where formal methods can be applied. This kind 
of systems are composed of a set of elements which need to be coordinated by 
means of a reliable communication protocol. 

In this paper we have chosen a cache coherence protocol as working exam- 
ple which will be developed through several stages of specification and verifica- 
tion. Cache coherence protocols range from simple “snooping” ones to complex 
“directory-based” frameworks [12]. In order to make a first approximation to the 
subject we will stick to one that belongs to the second group: the CC-NUMA 
protocol. Although its description is relatively simple it allows the definition of 
non-trivial properties. The verification of these properties illustrates the expres- 
siveness and potentiality of the formalisms and how they could be applied to 
more complex examples. 

In order to deal with a formal specification and verification of the cache 
coherence protocol, it is essential to use a mathematically-based technique or 
formal method. There are several formalisms that could be used to tackle this 
problem. Any protocol can be successfully described in terms of processes (con- 
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current agents which interact in accordance with a predefined pattern of com- 
munication [4]) and consequently modelled by using a process-based formalism. 
Specifically, we have used a process algebra — CCS [9] — for the specification, and 
an associated temporal logic — /x-calculus [7,13] — for the coherence verification. 
Furthermore, for these formal methods an automated tool — ‘Edinburgh Con- 
currency Workbench^’, CWB [1] — is available. Basically, it allows the definition, 
manipulation and verification of processes and temporal properties. 

This paper has been structured in four sections following this first intro- 
duction. Section 2 deals with process-oriented specification and verification of 
protocols; a justification of the chosen formalisms can be found here. Section 3 
describes the CC-NUMA cache coherence protocol. First of all, its specification 
in terms of communicating processes is presented; secondly, coherence restric- 
tions are defined in terms of temporal properties which are then automatically 
verified with regard to the previous specification. A brief summary and conclu- 
sions, as well as future lines of research, are discussed in Section 4. 



2 Specification and Verification of Protocols in CCS 

Cache coherence protocols (in general, any communication protocol) can be de- 
scribed in terms of ‘objects^ [10] which operate concurrently and interact in 
accordance with a predefined pattern of communication. This idea of communi- 
cating objects fits with the concept of process which allows the definition of an 
observable behaviour by means of all possible communications. 

There are different process theories, most of them with a notion of behaviour 
pattern of objects [5] or machine for performing actions [4]. Moreover, some of 
them are based on a well-founded underlying formalism and so are suitable to 
be used to formally specify, manipulate and verify cache coherence protocols. 
From all existing formal process theories we have chosen the process algebra 
CCS (‘Calculus of Communicating Systems’ [9] and the /x-calculus [14,13] as a 
complementary temporal logic which allows the definition of properties to be 
verified in relation to the specified processes (other formal methods can lead to 
valid results as well). The main features are: 

1. The CCS is a process theory which allows a formal manipulation of concur- 
rent communicating processes. 

2. There are higher-order extensions where dynamic structures communicating 
not only interaction channels but whole processes can be modelled. 

3. The CCS is complemented by a temporal logic — the /x-calculus — which 
allows the definition of temporal properties. The process of verifying whether 
a certain process satisfies a property can be automated. 

4. An automated tool — the CWB [1] — is available. 



^ See http://www.dcs.ed.ac.uk/home/cwb/index.html (29/10/1999) 
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3 The CC-NUMA Model 

A cache-coherent shared-memory multiprocessor system allows the users to have 
a logical perspective where all the processors have access to a shared global 
memory. This operation must be performed in such a way that it is transparent 
to the user the fact that memory is physically distributed and there must exist 
multiple copies of the same data element throughout the private caches of each 
processor. Whereas private caches improve system performance, they introduce 
the cache coherence problem. This means that an update of a memory element 
in any of the caches must be visible to all other processors in order to keep 
consistent cached copies of the same data element. This is achieved through 
a coherence protocol which is defined as a set of rules coordinating cache and 
memory controllers [11,12]. 

Among different existing cache coherence protocols we have chosen the CC- 
NUMA protocol [11] as the running example of this paper. It considers three 
states for cached blocks: invalid, shared, dirty. Invalid is the initial state of all 
the caches and the one where the block is considered to have been invalidated. 
The state of shared means that the local copy is valid and can be read. Finally, 
the state will be dirty if the local copy is valid and it can be written (ownership 
of the copy). 

On the other hand, the shared memory can show one of the following three 
states: Uncached, (initial state) where a block copy is found at the memory and 
not at the caches; Shared, at least one cache holds a copy which can be read; 
Dirty, memory information is obsolete due to a previous cache modification. 

3.1 Algorithm 

Taking into account that caches can perform read, write and replacement opera- 
tions, the protocol algorithm (from the perspective of cache Gi) can be described 
as follows: 

1. Read Hit. No cache coherence actions need to be done. 

2. Read Miss. When the block is not cached, a memory request is to be issued. 
In case the copy ownership is held by another cache, the memory should be 
updated (shared) and a copy transmitted to the demanding cache. 

3. Write Hit. A write operation is issued and the cache memory has a copy. If 
the block is dirty, then no action is performed. If it is shared, an ownership 
request is issued. 

4. Write Miss. Similar to write hit but the block cannot be found in the cache 
and a memory request is issued. If there is a copy owner, the information 
should be updated. Subsequently, the block copy will be transmitted to the 
waiting cache, along with its ownership. If the block copy is shared by several 
caches, then the memory will invalidate all the copies, giving a valid copy to 
the one trying a write operation. Finally, if there is no valid copy, then the 
memory will provide it (and ownership). 

5. Replacement. Update action of the main memory with a cache copy. 
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3.2 Specification 

The CC-NUMA protocol has been modelled — using CCS — for the particular 
case of 3 caches. This is due to the fact that we use a model checking technique 
where the size of the system’s state space grows exponentially with the number 
of subsystems. We have avoided the simplest examples (1-2 caches) but we have 
also tried to limit our problem difficulty to the extent of being able to keep track 
of all possible evolutions. Moreover, we present a well-structured specification 
and so it should not be difficult to increase the number of caches of our example. 

Other techniques not attempted here permit verification of parameterised 
systems, i.e. to perform formal reasoning in terms of the number of caches [2,8]. 

Processors send to their caches which operation is to be performed next. The 
protocol specification uses two different kinds of actions, visible and internal. 
Visible actions are used for cache-processor communications (read, write, repl). 
Every operation invocation is matched with an acknowledgement action (=i<g 
ending), also visible. Internal actions are used for the rest of the communications. 
This set of actions includes all the communications from caches to the memory 
and vice versa. 

From cache to memory: rmiss — req, request to the memory when ‘read 
miss’, wmiss — req, for a write operation, own — req, the cache holds a valid 
copy and a write operation has been issued, inv — ack, acknowledgement reply 
due to a memory invalidation, wback, the cache has updated the memory copy. 

From memory to cache: nack, operation denied, miss — reply, reply to a read 
request, miss — reply — own, reply to a write request, own — reply, reply to a 
own — req request, inv — req, invalidation request, wback — req, update request 
to the copy owner, wback — req — own, update request to the copy owner which 
is required to go to the invalid state. 

The following is the specification of the CC-NUMA protocol in CCS (CWB 
notation). For the sake of brevity, we specify just one cache (Cachel) since all 
of them share the same behaviour (instances of the same agent with different 
communication channels). 

*Invl, block invalid 

agent Invl=readl .PE-ROl + writel .PE-RW-INVl + repll.Invl + 

inv-reql . ’ inv-ackl . Invl + wback-reql . Invl + wback-req-ownl . Invl ; 
♦PE-ROl: cache is to comm, read operation to memory, 
agent PE-R01=’rmiss-reql .P-ROl + inv-reql inv-ackl .PE-ROl + 
wback-reql . PE-RO 1 + wback-req-ownl . PE-RO 1 ; 

♦PE-RW-INVl: cache is to comm, write operation to memory (invalid copy) 
agent PE-RW-INVl=’wmiss-reql .P-RW-INVl + inv-reql inv-ackl .PE-RW-INVl + 
wback-reql .PE-RW-INVl + wback-req-ownl .PE-RW-INVl ; 

♦ROl: valid copy that can be read 
agent R01=readl . ’readgl . ROl + writel .PE-RW-VALl + repll.Invl + 
inv-reql . ’ inv-ackl . Invl ; 

♦PE-RW-VALl: cache is to comm, write operation to memory (valid copy) 
agent PE-RW-VAL1=’ own-reql .P-RW-VALl + inv-reql inv-ackl .PE-RW-INVl ; 
♦RWl: valid copy that can be written 
agent RWl=readl . ’readgl . RWl + writel . ’writegl . RWl + repll .PE-Remp-RWl + 
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Fig. 1. CC-NUMA protocol (cache point of view) 



wback-reql. ’updatel. ’wbackl . ROl+wback-req-ownl . ’updatel. ’wbackl . Invl ; 
agent PE-Remp-RWl=’updatel . ’wbackl . Invl + 

wback-reql. ’updatel. ’wbackl . ROl+wback-req-ownl . ’updatel. ’wbackl . Invl ; 
*P-R01: cache waiting to complete read operation (read reply) 
agent P-R01=miss-replyl . ’readgl . ROl + inv-reql . ’ inv-ackl .P-ROl + 
wback-reql .P-ROl + wback-req-ownl.P-ROl + nackl .PE-ROl ; 

*P-RW-INV1: cache waiting to complete write operation (invalid copy) 
agent P-RW-INVl=miss-reply-ownl . ’writegl . RWl + 

wback-reql .P-RW-INVl + wback-req-ownl .P-RW-INVl + 
inv-reql. ’inv-ackl. P-RW-INVl + nackl .PE-RW-INVl ; 

*P-RW-VAL1: cache waiting to complete write operation (valid copy) 
agent P-RW-VALl=own-replyl . ’writegl . RWl + inv-reql .’ inv-ackl .P-RW-INVl + 
nackl. PE-RW-VALl; 

Figure 1 shows the state transition diagram of the CC-NUMA model. Only 
internal actions have been represented, and so transitions are due to communi- 
cations between the caches and the memory. Whenever a transition is triggered 
by an input action and requires an output action, it has been denoted by input 
action / output action. 

The shared memory agent (see Fig. 2 for the state transition diagram) is 
defined as follows: 
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♦Uncached: no cache holds a valid copy 
agent Uncached=rmiss-reql . ’miss-reply 1 . Shared + 

rmiss-req2 . ’miss-reply2 . Shared + rmiss-req3 . ’miss-replyS . Shared + 
wmiss-reql . ’ miss-reply-ownl . Dirtyl+wmiss-req2 . ’ miss-reply-own2 . Dirty2+ 
wmiss-req3 . ’ miss-reply-own3 . Dirty3 ; 

♦Shared: at least one cache has a copy 
agent Shared=rmiss-reql . ’miss-reply 1 . Shared + 

rmiss-req2 . ’miss-reply2 . Shared + rmiss-req3 . ’miss-reply3 . Shared + 

wmiss-reql . ’ inv-req2 . ’ inv-req3 . Shd-Dirty-Missl + 

wmiss-req2 . ’ inv-reql . ’ inv-req3 . Shd-Dirty-Miss2 + 

wmiss-req3 . ’ inv-reql . ’ inv-req2 . Shd-Dirty-Miss3 + 

own-reql . ’ inv-req2 . ’ inv-req3 . Shd-Dirty-Ownl + 

own-req2 . ’ inv-reql . ’ inv-req3 . Shd-Dirty-0wn2 + 

own-req3 . ’ inv-reql . ’ inv-req2 . Shd-Dirty-0wn3 ; 

♦Shd-Dirty-Miss : from Shared, cache with invalid copy requests write 
agent Shd-Dirty-Missl=inv-ack2 . inv-ack3 . ’miss-reply-ownl . Dirty 1 + 

rmiss-req2. ’nack2 . Shd-Dirty-Missl + rmiss-req3. ’nack3 . Shd-Dirty-Missl+ 
wmiss-req2. ’nack2 . Shd-Dirty-Missl + wmiss-req3. ’nack3 . Shd-Dirty-Missl ; 
♦Shd-Dirty-Own: from Shared with two copies, one tries write 
agent Shd-Dirty-0wnl=inv-ack2 . inv-ack3 . ’ own-replyl . Dirtyl + 

rmiss-req2 . ’nack2 . Shd-Dirty-Ownl + rmiss-req3 . ’nack3 . Shd-Dirty-Ownl + 
wmiss-req2 . ’ nack2 . Shd-Dirty-Ownl + wmiss-req3 . ’ nack3 . Shd-Dirty-Ownl ; 
♦Dirty: the memory value is obsolete 
agent Dirty l=rmiss-req2 . ’wback-reql . Dirty-Shd2 + 
rmiss-req3. ’wback-reql . Dirty-Shd3 + 
wmiss-req2 . ’ wback-req-ownl . Dirty-Dirty2 + 
wmiss-req3 . ’wback-req-ownl . Dirty-Dirty3 + wbackl .Uncached; 

♦Dirty-Shd: from Dirty, cache without valid copy issues read 
agent Dirty-Shdl=wback2 . ’miss-replyl . Shared + 
wback3 . ’miss-reply 1 . Shared + 

rmiss-req2 . ’nack2 . Dirty-Shdl + wmiss-req2 . ’nack2 . Dirty- Shdl + 
rmiss-req3 . ’ nack3 . Dirty-Shdl + wmiss-req3 . ’ nack3 . Dirty-Shdl ; 
♦Dirty-Dirty: from Dirty, cache without valid copy issues write 
agent Dirty-Dirty l=wback2 . ’miss-reply-ownl . Dirtyl + 

wback3 . ’miss-reply-ownl . Dirtyl + rmiss-req2 . ’nack2 . Dirty-Dirty 1 + 
wmiss-req2 . ’nack2 . Dirty-Dirty 1 + rmiss-req3 . ’nack3 . Dirty-Dirty 1 + 
wmiss-req3 . ’ nack3 . Dirty-Dirty 1 ; 

This is not the complete specification but a reduced version where all the 
requests are issued by the cache 1. The complete model includes the other pos- 
sibilities — Dirty2, DirtyS, Shd-Dirty-Miss2, Shd-Dirty-Miss3, Shd-Dirty-Own 2, 
Shd-Dirty-Own3, Dirty-Shd2, Dirty-Shd3, Dirty-Dirty2, Dirty-Dirty3 — which 
are constructed in a similar way. 

Now it is possible to model the whole protocol as the concurrent composition 
of all the agents — caches and memory — where all the internal actions have been 
restricted so that no other ‘external’ agent can interfere with them. Initially, the 
protocol is considered to have an uncached copy. 

agent Protocol=(lnvl I lnv2 1 lnv3 1 Uncached) \lnter; 
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wmiss-req/nack 




Fig. 2. CC-NUMA protocol (memory point of view) 



3.3 Verification of Coherence Properties 

Cache coherence is accepted as far as it can be verified in terms of data con- 
sistency (otherwise inconsistent data copies may be observed when a processor 
modifies that data copy in its private cache [12]). 

Traditionally, simple protocol correctness has been established by using tech- 
niques such as testing and simulations. However, the need for high performance 
and scalable machines has made coherence protocols much more complex so that 
new techniques are needed. Some of the most promising verifying techniques are 
based on formal methods and mechanical checking procedures (see [11] for a com- 
parison): (1) state enumeration of all possible agent interactions, (2) symbolic 
state model [11], where equivalence relations among global states are defined in 
order to represent a canonical state model, and (3) model checking [6], where it 
is verified whether a given state system is a model for a property specification 
or not. 

In this paper we have used a temporal model checking [3] technique. Model 
checking verification is based on the possibility of assuring that a particular state 
(agent with a certain state) satisfies a specification. This specification is, pre- 
cisely, the property to be verified. In our case, the property will represent data 
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consistency and the state will be the agent (or set of agents) implementing the 
protocol. Data coherence properties are represented as modal-temporal proper- 
ties. This is due to the implicit temporal character of the coherence constraint: 
“it is never possible to reach a state where an invalid read takes place”. The 
temporal logic we use in this paper is the modal /x-calculus [7,13] which includes 
temporal modalities that allow the description and reasoning of time-varying 
behaviour — how the truth values of assertions change over time. Indeed, the 
modal mu-calculus is a very expressive propositional temporal logic which can 
be used to describe safety, liveness, fairness and cyclic properties. 

The problem of verifying whether a particular process satisfies a property or 
not is decidable (taking for granted that the total amount of possible states is 
finite). Nevertheless, the actual efficiency will depend on the number of possible 
states and the intrinsic complexity of the property. 

3.4 Coherence Property 

Coherence means data consistency. This notion, excessively general to be rep- 
resented as such, can be described in terms of the agents taking part in the 
protocol and the communication actions they can perform. Therefore, it is pos- 
sible to make a second description of coherence in the following terms: data is 
consistent if whenever it is accessed by an agent, the value is precisely the last 
that was assigned (by any processor). A third description characterizes coherence 
as a “future-tense’^ specification: after a write operation of a data block (new 
value), no later access will be possible unless a value update takes place. This 
new property description can be considered to be equivalent to the previous ones 
and more appropriate for proving that it is satisfied by the described protocol. 

In order to keep the verification as simple as possible, it has been considered 
that one single data position is managed. For notation purposes, let the index i 
refer to the writing cache, and j, k to the other ones. 

The /r-calculus is a recursive propositional temporal logic including^ : propo- 
sitional variables, Z, boolean connectives, modal operators, [],<> — immediate 
necessity and capability — respectively, and fixed-point operators — least and 
greatest — max, min. An intuitive explanation of fixed-points is usually given in 
the following terms: min does not allow an infinite behaviour without something 
(good?) eventually to happen {liveness), whereas max enforces something (not 
bad?) invariantly to happen {safety). 

The coherence property — as stated above — is specified by the following 
/i-calculus formula: 

prop Coherence^ =max(X. [— 'writegijX & 

['writegi](max(Z. ['readgj_]j,'writegj_]j]F & 

[— 'updatGijZ))) 



For a full description of modal and temporal logics for processes, and the p-calculus 
see [13]. 
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1. The first greatest fixed-point (X) establishes that after any action other than 
a write operation, the same (coherence) property has to be satisfied. This is 
a safety component of coherence. 

2. After the write operation, a second greatest fixed-point (Z) establishes that, 
(a) no other cache can complete any access ['readgj^ij;,' writegj jjjF — either 
read or write — and (b) recursively continues to hold (Z) upon doing any 
derivation other than update. The only way to “get out” of this construction 
and allow other read/write operations is by means of a previous update. 

3. Yet another safety construction could have been added to state that this 
write- update schema holds onwards. 

Verifying that the CC-NUMA protocol satisfies this coherence property has 
been accomplished using the CWB, an automated tool which caters for the 
manipulation and analysis of concurrent systems. In particular, the CWB allows 
for various equivalence, preorder and model checking using a variety of different 
process semantics. 

The number of states of a finite-state agent is a good indication of the model 
size. For the particular case of three caches, the CWB has found 1273 states. 
Other interesting commands give users the possibility of listing all the states, the 
transitions of an agent, running a simulation of an agent interactively, finding 
the derivatives of an agent via a given action, and many others. 

As far as model checking is concerned, it is possible to verify predefined 
properties — dead- or lived-locked states and traces leading to them — and user- 
defined formulas — e.g., the coherence property stated above. Proving coherence 
is defined in the CWB as a command that checks the proposition Coherence over 
the agent Protocol (does the agent satisfy the formula?). This is answered using 
a game-hased model checking algorithm which generates a winning strategy for 
the corresponding model checking game (see [13] for a complete description of 
this technique). 

The CWB execution of checkprop(Protocol, Coherence) gives a ‘true’ re- 
sult, and the corresponding winning strategy constitutes a proof of the coherence 
property with respect to the specified CC-NUMA protocol. Indeed, there were 
intermediate specifications which could not be proved to be correct due to differ- 
ent modelling errors. The usefulness of the CWB lies in the possibility of playing 
model checking games against the tool. Game traces help to understand why a 
process does or does not satisfy a formula. This feature allowed us to detect the 
modelling errors until having a correct specification. 



4 Conclusions and Future Work 

In this paper we have described an application of formal methods to the speci- 
fication and verification of a cache coherence protocol. Specifically, the protocol 
has been specified as a set of interacting processes in CCS and the coherence 
constraints have been described as temporal propositions in the /x-calculus. The 
results we have presented allow us to conclude that these formalisms are of a 
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great interest and potentiality since: (1) the advantages of a formal specification 
are evident — e.g., the lack of ambiguity and the possibility of automatic ma- 
nipulation, (2) automatic verification (tool assisted) allows the user to propose 
different properties and to check whether the protocol satisfies them or not. 

We think that there are very promising research lines related to this subject. 
It is clear that the specification and verification techniques used in this paper 
could be used for the case of more complex protocols. Theoretically, this is 
possible no matter how sophisticated the protocol is. The only limitation is that 
the number of possible states is finite. In practical terms, we can make use of 
powerful tools which allow us to manipulate specification and to automatise the 
verification of properties. 

It would be interesting to make a systematic study of how different con- 
sistency and coherence properties could be characterised in terms of temporal 
propositions. The application of specification and verification techniques could 
be applied not only to coherence protocols but to any other communication pro- 
tocol. Moreover, they could be applied to any “process-oriented’^ system — i.e., 
a system that can be accurately modelled as a set of communicating processes 
that interact and exchange information. 
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Abstract. We introduce the j^SPI-calculus that strengthens the notion 
of “perfect symmetric cryptography” of the spi-calculus by taking time 
into account. This involves dehning an operational semantics, dehning 
a control flow analysis (CFA) in the form of a flow logic, and proving 
semantic correctness. Our hrst result is that secrecy in the sense of Dolev- 
Yao can be expressed in terms of the CFA. Our second result is that 
also non-interference in the sense of Abadi can be expressed in terms of 
the CFA; unlike Abadi we hnd the non-interference property to be an 
extension of the Dolev-Yao property. 

1 Introduction 

The widespread usage of distributed systems and networks has furnished a great 
number of interesting scenarios in which security plays a significant role. Well 
established and well founded process algebraic theories offer a fertile ground to 
express distributed and concurrent systems in pure form, and to study their 
properties. In particular, protocols and security protocols can be conveniently 
written in the spi-calculus [1,5], an extension of the 7r-calculus with primitives 
for encryption and decryption. These are based on symmetric cryptography that 
is assumed to be perfect; as usual this is formulated in an algebraic manner: that 
encryption and decryption are inverses of one another. This facilitates expressing 
cryptographic protocols and one can reason on them exploiting the rich variety 
of techniques and tools, developed for calculi of computation and programming 
languages. 

As observed in [1] the notion of perfect encryption embodied in the spi- 
calculus is too weak to guard against certain attacks based on comparing cipher- 
texts. As an example, consider a process that first communicates true encrypted 
under some key, then false encrypted under the same key, and finally a secret 

* The hrst two authors have been partially supported by the Progetti MURST TOSCA 
and AI, TS & CFA. 
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boolean b encrypted under the same key; then confidentiality of b is not guaran- 
teed because its value can be obtained by comparison of ciphertexts. To guard 
against this form of attack a type system is developed that enforces placement 
of so-called confounders in all encryptions [1]. 

By contrast our approach is based on the observation that many symmet- 
ric cryptosystems, e.g. DES operating in a suitable chaining mode, are always 
initialised with a random initialisation vector, thereby dealing with a notion of 
confounders dynamically. To be closer to real implementations, we therefore use 
a slight modification of the spi-calculus, called z/SPi. We semantically model the 
randomization of the encryption function, by adding to each plaintext M a new 
and fresh value r, making any two encryptions of M different from each other. 
In other words, we obtain a notion of history dependent cryptography. Recent 
and independent developments along a similar line of thought may be found 
in [20,3]. Indeed, it seems unlikely that any approach only based on algebraic 
identities (and consideration of the free theory generated) will be able to mimic 
our semantics-based development. 

In preparation for the applications to security we then develop in Section 3 a 
Control Flow Analysis (CFA) in the form of a Flow Logic [21]. Its specification is 
in line with previous developments for the 7r-calculus [8,7] and the same goes for 
its semantic correctness by means of a subject-reduction result and the existence 
of least solutions. However, the techniques needed for obtaining the least solution 
in polynomial time (actually 0[n^)) are more involved than before because the 
specification operates over an infinite universe [23,25]. 

Our first application to security in Section 4 is to show that CFA helps in 
showing that a protocol has no direct flows that violate confidentiality. The 
static condition, called confinement, merely inspects the CFA information to 
make sure that only public messages flow along public channels. The dynamic 
condition, called carefulness, then guarantees for all executions that no secrets 
are output on public channels. Correctness of the static analysis then follows 
from the subject-reduction result. This notion of security essentially says that 
no attacker, not even an active saboteur, can decipher a secret message sent on 
the network; actually, we show that if a process is careful then it preserves the 
secrecy of messages according to the notion originally advocated by Dolev and 
Yao [16,2]. A similar result has independently been achieved by [4] using a type 
system on a slightly different calculus. 

Our second application to security in Section 5 is to show that CFA also 
helps in checking that a protocol has no indirect flows that violate confidential- 
ity. In the formulation of Abadi [1] the static condition is formulated using a 
type system and the dynamic condition then compares executions using testing 
equivalence [13,10]. In our formulation the static condition, called invariance, 
is formulated as yet another check on the CFA information, and we retain the 
dynamic notion, which we prefer to call message independence. (Both our and 
Abadi’s dynamic notions say that the active attacker cannot detect whatsoever 
information about the message sent, even by inspecting and changing the be- 
haviour of a secure protocol; but this does not quite amount to non-interference 
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in the sense of [17].) While inspired by [24, Section 5.3] this represents the first 
use of CFA for establishing testing equivalence for processes allowing cryptogra- 
phy. In our approach, confinement is a prerequisite for invariance, thus suggesting 
a deeper connection between Dolev-Yao and Non-Interference than reported by 
Abadi [2]. 

A more widely used alternative approach for calculi of computation and 
security is based on Type Systems [1,32,31,29,30,19,27,28,12,11]. Here secu- 
rity requirements are seen as static information about the objects of a system 
[1,32,12,11]. Our approach builds on the more “classical” approaches to static 
analysis and thus links up with the pioneering approach taken in the very early 
studies by Denning [14,15]; it also features a very good computational complex- 
ity. 

Because of lack of space, we dispense with the proofs, which often use tech- 
niques similar to those of [7] and that can in part be found in the extended 
version of the paper. 



2 History Dependent Cryptography 

Syntax. We define the z/SPi-calculus by modifying the spi-calculus [5] (we con- 
sider here its monadic form, for simplicity) so that the encryption primitive 
becomes history dependent. Roughly, this amounts to saying that every time 
we encrypt a message we get a different ciphertext, even if the message is the 
same and the key is the same. We do so by changing the semantics: each en- 
cryption necessarily generates a fresh confounder that is part of the message 
(corresponding to the random initialisation vector used when running DES in 
an appropriate chaining mode); therefore our analysis does not need to enforce 
this property (unlike the type system in [1]). This naturally leads to modifying 
the semantics to evaluate a message before it is actually sent; in other words 
we define a call-by-value programming language. — To aid the intuitions of the 
reader familiar with the spi-calculus we have also changed the syntax by letting 
each encryption contain a construct for generating the confounder; however this 
syntactic change is in no way essential (quite unlike the semantic change). 

The formulation of the CFA of the pspi-calculus, in Section 3, is facilitated 
by making a few assumptions. Mainly, we slightly extend the standard syntax by 
mechanically assigning “labels” to the occurrences of terms; these are nothing 
but explicit notations for program points and in an actual implementation can 
be taken to be pointers into the syntax tree. Furthermore, to deal with the a- 
renaming of bound names in a simple and “implicit” way, we assume that names 
are “stable” , i.e. that each name a is the canonical representative for its class of 
a-convertible names. To this aim, we define the set of names JV' as the disjoint 
union of sets of indexed names, JV = • • 'jj and we write = a 

for the canonical name a associated to each actual name a^. Then we restrict 
a-conversion so that we only allow a name to be substituted for the name bj, 
if [ttjj = \ bj\. In this way, we statically maintain the identity of names that may 
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be lost by freely applying a-conversions. (For a more “explicit” approach using 
marker environments see [8,6].) 

Definition 1. Let I, I' , L £ £. be labels, n,m, - ■ ■ € JV be names and x,y, - ■ ■ € V 
be variables. Then (labelled) expressions, E, V e £, (unlabelled) terms, M^N G 
Ad, values w,v £ V al' , and processes, P, Pi, Q, R, ■ ■ ■ & P , are all built according 
to the following syntax: 

E, V ::= M' 

M,N ::=n \ x \ [E,E') \ 0 | suc[E) \ {E\, - ■ ■ , Ek,{vr) r^Eo 
w,v ::= n \ pairfw, w') \ 0 | suciw) \ enc{wi, • • • , Wk,r),jj^ 

P,Q::=0 I E{V).P \ E{x).P \ P\P \ {vn)P \ 

[E is V]P \ \P \ let (x, y) = E in P \ 

case E of 0 : P sucfx) : Q \ case E of {xi, ■ ■ ■ ,Xk}v in P 

Here E[x).P binds the variable x in P, while [vn)P binds the name n in P, 
We dispense with defining the standard notions of free and bound names (fn and 
resp. bn) and of free and bound variables fv (resp. bv). We often omit the trailing 
0 and write ^ to denote tuples of objects. 

The z/SPi-calculus slightly extends the spi-calculus, that in turn extends the tt- 
calculus (with which we assume the reader to be familiar) with more structured 
terms (numbers, pairs and encryptions) and process constructs dealing with 
them. Moreover, our term {Ei, - ■ ■ ,Ej.,{vr) r}E„ represents the unevaluated 
encryption of E\, - ■ ■ , Ek under the symmetric key Eq- Its evaluation results 
in the actual value encfwi,- ■ ■ ,Wk,r}^uo, where Wi is the value of Ei and the 
restriction (z/r) will make sure that the confounder (or initialisation vector) r is 
fresh (see below). The process case E of {xi, ■ ■ ■ ,Xk}v in P attempts to decrypt 
E with the key V: if E is on the form {Ei, ■ ■ ■ ,Ek}v then the process behaves as 
P[Ei/xi], otherwise the process is stuck. Similarly, let (x,y) = E in P attempts 
to split the pair E and case E of 0 : P suc[x) : Q tries to establish if E is either 
0 or a successor of some term. 

Note that, unlike the 7r-calculus, names and variables are considered distinct. 
Finally we extend [• • -J to operate on values by the straightforward structural 
definition. We write V al for the set of canonical values, i.e. those values v such 
that [vj = V. 

Entities are considered equal whenever they are a-convertible; so P = Q 
means that P is a-convertible to Q. Substitution of terms, • • • [M Jx ] , is standard; 
substitution of expressions, • • • [E/x], really denotes substitution of terms, so it 
preserves labels, hence x'^ [M' /x:\ is finally, substitution of restricted values, 

• • • [[vr)w/x], acts as substitution of values, • • • [w/x], with the restriction moved 
out of any expressions, e.g. n{x)[{]yr)r / x] = [vr)n{r). We shall write P = Q to 
mean that P and Q are equal except that restriction operators may be placed 
differently as long as their effect is the same, e.g. [vr)n{s) jfi{r) = n{s) .[vr)rfi{r) . 

Semantics. The semantics is built out of three relations: the evaluation, the 
reduction and the commitment relations. In all of them we will apply our disci- 
plined a-conversion when needed. They all operate on closed entities, i.e. entities 
without free variables. 
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Table 1. The semantics of z^SPI: the evaluation relation, the reduction relation, 
>; and the commitment relation, (without symmetric rules). 



A’i S> (vTi) Wi, i = l,2,rir2 w.o. duplicates 

1. n* > n 2. 0* > 0 3. ^ ^ 

(Ul,l? 2 ) (yrir 2 ) pair(mi,W 2 ) 

E 3 > (vf) m 

4. 

suc(^E) 3 > (yr) suc(m) 

A’i S> Wi, i = 0 ■ ■ ■ k,fi ■ ■ ■ r~kr~or w.o. duplicates 

{E\, ■■■,E\, {vr) r}Ef^ > {vr~i ■ ■■r\ror) enc{wi, ■ ■ • , Wfc,r}„o 



Match : 



Let : 



{yr iT 2 ) wi 



Ei > {vfi) Wi, i = 1, 2 

[El is E2]P > {icr~ir~2) F 

E (yf) pair[wi,W2) 
let {x,y) = E in F > (yf) F[wi/x,W2/y\ 



(yrir2) W2] rir2/n(P) w.o. duplicates 
ffnl)F) w.o. duplicates 



A > 0 

Zero : Rep : 

case E of 0 : F suc{x) : Q > F IF > F |!P 

E 3> (tcf) suc(w) 

Sue : ^ rfn{Cf) w.o. duplicates 

case E of 0 : F suc{x) : Q > (j/r) Q[w/x\ 

E iy {vr~o) enc{wi, ■ ■ ■ ,Wk, s}woi ^ ^ 'J' 

Enc : ^ 

case E of {xi, • • • , a:fc}v in F > (yro) F[wi/xi, • • • , Wf^lxff\ 

r~or~ifn()F) w.o. duplicates; (i/rori) wq = [i/r~or~i) v 



In : m{x).F -EL. 


{x)F 


Out 


Inter : 


p p 


Q^C 


Par 


e\QE 


Sf E@C 


Red : 


F> Q Q 


-Ey A 


Res 


F^ 


A 


Congr 


F=Q 


Q^A 


A = B 




fe^b 





Al* (i/r) w 

= rfn()F) w.o. duplicates 

m(M*).P (yf)(vJ')F 

F ^ A 



R\Q^A\Q 



F 

(ym)F 




A 

(ym)A 



a 



{m, m} 



The evaluation relation ^ in the upper part of Table 1 reduces an expression 
E to a value w. Although it is not part of the standard semantics of the spi- 
calculus, it is quite natural from a programming language point of view, and it 
is crucial in specifying history dependent encryption. As it will be clear soon, a 
term has to be fully evaluated before it is used either in a reduction (e.g. when 
matching or a decryption takes place) or as a message. So to speak, our variant 
of the calculus is a call-by-value one. The central rule is that for encryption: the 
restriction [vr) acting on the confounder r is pushed in the outermost position, 
so that every other name in the process is and will be different from r. 

Two different occurrences, M* and (with I ^ V), oi a term containing an 
unevaluated encryption operator, never evaluate to the same values, [vf)w and 
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[vf')w', where w = w' and the concatenation of vectors of names f and r' has 
no name occurring more than once (abbreviated rr' w.o. duplicates). 

This is crucial for matching; indeed, [{0, is {0, {vr)rY'^]F never re- 

duces to P, because every time we encrypt 0, even if under exactly the same 
evaluated key, we get a different value. The reduction rules in the central part 
of Table 1 govern the evaluation of guards. They only differ from the standard 
spi-calculus ones because, as mentioned above, the terms occurring in P that 
drive the reduction P > Q have to be evaluated. This step may introduce some 
new restricted names, in particular when the terms include an encryption to be 
evaluated. This restriction is placed around Q, so as to make sure that the new 
names are indeed fresh and that there will be no captures. The side condition 
^P\f 2 fn[P) w.o. duplicates” in the rule Match ensures that the scopes are pre- 
served even though the restrictions are placed differently; similarly for the other 
side conditions. Finally, note that after a decryption, the process P has no access 
to the confounder s. 

To define the commitment relation, we need the usual notions of abstraction 
P = [x)P and of concretion C = {i/h){v/)Q (assuming that (x)T* | Q = {x){P \ 
Q), if X ^ fv[Q), that ivh){w^)Q \ R = [vh)(w^){Q | ii), if hnfn{R) = 0, and 
the symmetric rules). Note that the message sent must be an actual value. The 
interaction F@C (and symmetrically for C@F'J is then the following, provided 
that {n} C\fn[P) = 0: 

P@C = (i/n)(P[w7x] I Q) 

The structural operational semantic rules for the commitment relation are in 
the lower part of Table 1; they are standard apart from rule Out that requires 
the evaluation of the message sent, and introduces the new restricted names f 
(possibly causing also some a-conversions) . 

3 Control Flow Analysis (CFA) 

Writing V al = p(V al) the result of our analysis for a process F is a triple 
{pmX), where: 

— p : V — ^ V al is the abstract environment that associates variables with the 
values that they can be bound to; more precisely, p[x) must include the set 
of values that x could assume at run-time. 

— K : M ^Val is the abstract channel environment that associates canonical 
names with the values that can be communicated over them; more precisely, 
n[n) must include the set of values that can be communicated over the 
channel 

— C ^Val is the abstract cache that associates labels with the values that 
can arise there; more precisely ()l) must include the set of the possible actual 
values of the term labelled 1. 

Acceptability. To define the acceptability of a proposed estimate (p, k, () we state 
a set of clauses operating upon flow logic judgments on the forms (p, k,() |= M 
and(p,K,C) 1= T*. 




Static Analysis for Secrecy and Non-interference in Networks of Processes 



33 



The analysis of expressions and of processes are in Table 2. Our rules make 
use of canonical names and values and of the following abbreviations, where 

W e V^l: 

- suc(lT) for {suc{w)\w € IT}; 

- pair(IT, it') for {pair{w,w')\w e IT, w' e IT'}; 

- enc{TTi, • • • , TTA,,r}wo for {enc{wi,- • ■ ,ivk,r}^^\yi : Wi e Wi}. 

All the rules for validating a compound term or a process require that the com- 
ponents are validated. The rules for an expression M* demand that <}(/) contains 
all the values associated with its components. Moreover, the rule for output re- 
quires that the set of values associated with the message N can be passed on 
each channel associated with M. Symmetrically, the rule for input requires that 
each value passing along M is contained in the set of possible values of x, i.e. 
p[x). The last three rules check the i"' sub-components of each value associated 
with the expression to split, compare or decrypt. Each sub-component must be 
contained in the corresponding p{xi). 

Finally, the analysis is extended to concretions and abstractions in the last 
part of Table 2. 

Correctness. To establish the semantic correctness of our analysis we establish 
subject-reduction results for the evaluation, the reduction and the commitment 
relations of the previous section. 

Theorem 1 (Subject Reduction for ^,> and — 

Let M’' e S; if {p, k, Q \= M'- and M’' >> (vr) w then [wj G ({1). 

Let P be a closed process such that (p, k, <}) |= P; 

(1) if P > Q then {p,k,Q \= Q. 

(2) if P Q then (p, k, C) |= Q; 

(3) if P [vn){w'')Q then[p,K,Cf) \= {nn)(w'')Q and C n(^\m\); 

(4) if P (pyn){x)Q then [p, K,Cf) \= {vh)ix)Q and Kif\jn\) C pix). 

existence. So far we have only considered a procedure for validating whether or 
not a proposed estimate (p, k, () is in fact acceptable. Now, we show that there 
always exists a least choice of (p, k, () acceptable in the manner of Table 2. 

It is quite standard to partially order the set of proposed estimates by setting 
(d, K,(}) C (p',k',()') if and only if Vx G V : p(x) C p'(x), Vn G TV : n{n) C 
nfn) and \/l £ £. : ({1) C ('{1). Furthermore, a Moore family X is a set that 
contains \1J for all C X, where □ is the greatest lower bound operator (defined 
pointwise). One important property of a Moore family is that it always contains 
a least element. The following theorem then guarantees that there is always a 
least estimate to the specification in Table 2. Its statement concerns processes 
and the proof relies on analogous statements for expressions; this also holds for 
some of the following results. 

Theorem 2. The set {(p, k, Q \ (p, k, Q |=P} is a Moore family for all P. 
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Table 2. CFA for expressions, processes, concretions and abstractions. 




Polynomial Time Construction. In [7] we developed a polynomial time pro- 
cedure for calculating least solutions. This development does not immediately 
carry over because we now operate over an infinite universe of values due to the 
expressions present in the calculus. Therefore the specification in Table 2 needs 
to be interpreted as defining a regular tree grammar whose least solution can be 
computed in polynomial time. A recent result [25] in fact shows that the time 
complexity can be reduced to cubic time. 

4 CFA and Dolev-Yao Secrecy 

In this section, we extend to the pspi-calculus the static property of confinement, 
studied in [8] for the 7r-calculus. We then show that our notion corresponds to 
that of Dolev and Yao [16,9,26,2]. 
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The Dynamic Notion. The names, , are partitioned into the public ones, V, 
and the secret ones, S, in such a way that n € iff Afn Q S. We demand 
that the free names of processes under analysis are all public; it follows that the 
secret names either do not occur at all or are restricted within a process. This 
partition is used as a basis for partitioning (also non canonical) values according 
to the two kinds s (for secret) and p (for public). The intention is that a single 
“drop” of secret makes the entire value secret except for what is encrypted with 
a secret key, which is anyway public. We do not consider confounders as they 
are discarded by decryptions. 



Definition 2. The operator kind : V at' ^ 'is defined as 

-Und{n) = S^l 

— kindiO) = P] - kind[suc[w)) = kind[w); 



, , . - ,,, { s if (kind(w) = s V kindiw ) = s) 

— kind{pair[w,w )) = < . 

^ ^ ^ ^ \ p otherwise; 



— kind[enc{wi, - ■ ■ ,Wk,r}i^ 



( p if kindiwo) = s V k 

1 kind[{wi, ■ ■ ■ ,Wk}) otherwise, 



where, by abuse of notation, kind{W) = | ^ ' kindjw) 

’ ^ j V / ypifyweW:kind[w) 

write ValP for the set of canonical values of kind p. 



= s 
= P. 



We shall 



To define the dynamic notion of secrecy we write F — Q to mean that F — t 
• • • — t Q. Then carefulness means that no secrets are sent in clear on public 
channels: 



Definition 3. A process F is careful w.r.t. S iff whenever F — F' F” , 

with the last step deduced with the premise R [nr){w^)R' , then m £ V 
implies kind[w) = p. 



The Static Notion. We now define the confinement property for the z/SPi-calculus. 
It predicts at compile time that a process is careful. A check suffices on the k 
component of a solution: the set of values that can flow on each public name n 
must be all the ones that have kind p. 

Definition 4. A process F is confined w.r.t. S and if and only if 

{p, K,() 1= 0 /nd Vn G P : n[n) = ValP. 

The subject reduction theorem extends trivially to confined processes thereby 
paving the way for showing that the static notion implies the dynamic one. 

Theorem 3. If F is confined w.r.t. S then F is careful w.r.t. S. 



Example 1. We consider here an adaptation of the Wide Mouthed Frog key 
exchange protocol as presented in [5]. The two processes A and B share keys 




36 



C. Bodei et al. 



Kas and Kbs with a trusted server S. In order to establish a secure channel 
with B, A sends a fresh key Kab encrypted with Kas to S . Then, S decrypts 
the key and forwards it to B, this time encrypted with Kbs- Now A can send 
a message M encrypted with Kab to B (for simplicity, M is a name). The 
analysis guarantees that M is kept secret. The protocol and its specification are 
as follows: 



Message 1 kl — t S ' : {Kab}kas 
Message 2 S ^ B : {Kab}kbs 
Message 2> A ^ B : {M }kab 

P={pKas){pKbs){{A\B)\S) 

A = (iyKAB)(cfs({^ABA^A)ri} i,, }.c2%({M‘^,(iyr2)r2j i ,, )) 

S = c%{x).case x'-^ of {s} inc^{{s'-%{Br3)ra} 

^AS ^BS 

B = cfs{t).caset’-* of {y} in c%{z) .case z'-‘ of {g} in B'{q) 

Let S = {Kas , Kbs , Kab: -1^} and V = {casAbsAab}i the relevant part of 
an estimate for F (disregarding B'[q)) is: 

, . J ValP \ibv e {x,s,t,y,z,q} , . J ValP if c 6 {cas, css, cab} 

A 0 otherwise 1 0 otherwise 

Moreover, C(4t>) = Pibv) for bv G {x,s,t,y,z,q} and ({[) = {n}. for all the 
names K occurring in F. It is now easy to check that F is confined, hence the 
secrecy of M is guaranteed. ■ 

The Formula, tion of Dolev and Yao. We now show that our notion of confinement 
enforces secrecy in the sense of Dolev and Yao [16,9,26,2]. Its inductive definition 
simulates the placement of a process T* in a hostile environment that initially 
has some public knowledge, and thus knows all the values computable from 
it. Then, the environment may increase its knowledge by communicating with 
F. The secrecy requirement is that a message M is never revealed by F if the 
environment cannot reconstruct M from the initial knowledge and the knowledge 
it has acquired by interacting with the process F. 

In our case the initial knowledge is given by the numbers and by all the names 
that are not considered secret, among which those free in the process under 
consideration. In other words, we are interested in keeping secrets of honest 
parties, only. The values whose secrecy should be preserved are composed of 
at least a secret name, except for secret terms, when encrypted (and therefore 
protected) under a secret key. 

We first make precise which messages are computable from a given set of 
canonical messages W C Val. The function C : V al ^ V al is specified as the 
closure operator (meaning that C is idempotent and extensive: G[G[W)) = 
G[W) D W) associated with the following inductive definition (where “iff” is 
short for a rule with “if” and one with “only if” ) : 
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-OeC{W); -WCC\W)] -w eC{W) iS suc{w) eC{W); 

— pair(w,w') e CiW) iff w e CiW) and v)' e CiW)] 

— if Vi : Wi € C'(VK) then Vr € VK : enc{wi, ■ ■ ■ , Wj.,r}^^ € (7( VK); 

— if enc{wi, • • • , &C(W), WoeC(W) then Wi,- ■ ■ ,wj.eC{W). 

The following relation TZ (or TZko,Po t>e pedantic) specifies how the environ- 
ment, which knows a set of names Ko, can acquire some additional knowledge 
by interacting with a process Pq: 

— n{Po,C{Ko)); 

— if TZ{P, W) and P — ^ Q then P{Q, W); 

— if 7^(T’, W), P (x)Q, [mj G W and [wj G W then P(Q[w/x], W); 

— if P{P, IT), P [vn){w’')Q and [mJ G IT then TZ{{i/h)Q, 

C{Wu{lwj})). 

The notion of secrecy put forward by Dolev and Yao can now be phrased 
as follows. (Recall that fn[Po) C P; Pq is closed; and that the names TV”' are 
partitioned in S and P.) 

Definition 5. The process Pq may reveal M from Kq C P , with M {yf)vj 
and kindiw) = s, if3P', W' s.t. P{P' , IT') and [wj G IT'. 



The Comparison. Next, we consider the most powerful attacker or saboteur S, 
and define the format of its estimate, which therefore will be an estimate for 
any other attacker. From this estimate and one confining P, we can construct 
an estimate showing that P | S' is also confined. In other words, P can be placed 
in any context without disclosing its secrets. Typically, the estimate for S will 
involve expressions of kind p, only. This and the following lemma deeply depend 
on the Moore family property (Theorem 2). To state this succinctly define the 
restrictions k^q, ()|l {P QV,C C J\f,L C £) as follows: 






_/p( x) if X e B 

o.w. 



^|c)(n) = | 



n{n) if n e C 
0 o.w. 



. wo_fC(0 if/eT 

T|l)(O-|0 o.w. 



We now characterize the shape of estimates for an attacker Q. 



Lemma 1. Let Q he a closed process with all names in P; then 
(p', Kp, ()') 1= Q w/iere Vx,Vn G : p'{x) = K|p(n) = ('{1) = ValP. 

Given an estimate for P, we can reduce it to act on the variables and labels of 
P only. 

Lemma 2. Let B and L he the sets of variables and labels in P , then (p, k, Cf) \= 
P if and only if {p\b, k,C\l) |= B- 

From the estimate confining P, we can construct an estimate confining P \ Q, 
using the above estimate for Q and the above lemma. 
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Proposition 1. Let P be confined w.r.t. S; and let Q he a closed process with 
names all in V , and such tha,t all variables and labels occurring inside Q do not 
occur inside P. Then P \ Q is confined w.r.t. S . 

Due to this proposition, there is no need to actually compute the estimate for 
the most powerful attacker S or for any attacker Q and more importantly that 
for P I S: the estimate for P as defined in Defn. 4 suffices for checking secrecy. 
Indeed, since P is confined, so is P | Q which also is careful, by Theorem 3. This 
suffices for proving that P never reveals secret messages to an attacker knowing 
only public data. 

It follows that our static notion of confinement suffices to guarantee Dolev 
and Yao’s property of secrecy. Indeed, a confined (and thus careful) process never 
sends secrets in clear on public channels. 

Theorem 4. A process P confined w.r.t. S, does not reveal any message M, 
with M ^ [i/f)vj and kind(w) = s, from, any C P. 

5 CFA and Message-Independence 

The notion of secrecy seen above does not guarantee absence of implicit infor- 
mation flow, cf. [2] for more explanation. A typical case of implicit flow is when 
a protocol P behaves differently, according to the result of comparing a secret 
value against a public one. In this case, an attacker can detect some information 
about a message sent by noticing, e.g., that the message is not the number 0. 
In this section, we follow Abadi’s approach [1], and consider the case in which a 
message received does not influence the overall behaviour of the protocol, even 
in presence of an active attacker Q. Note however that Q running in parallel 
with P may change the behaviour of P, e.g. by sending a message that permits 
to pass a matching. We shall show that our CFA can guarantee this form of 
non-interference, that we call message independence. 

More precisely, we shall make sure that no attacker can detect whether a 
process P(x) (where for simplicity, x is the only free variable) uses a message M 
or a different one M' in place of x. To interface with the developments of Section 
4 we shall focus on a specific canonical channel £ S not otherwise used; it 
will be used to track the places where the value of x may reach. Technically, we 
can either assume that all solutions (p, kX) considered have p(x) = {n*} or else 
substitute n^. for x in all instances where we invoke the analysis and the notion 
of confinement. 

To cater for this development, we assign two sorts to values, according to 
whether they contain n^.. or not (again, encryption is an exception). Intuitively, 
a value w has sort 1 if either does not occur in w, or it appears encrypted; 
otherwise n* is “visible” in w that then gets sort E. Also, note that if kind[w) = p 
then sorfiw) = I. 

Definition 6. The operator sort :Val'^ {L^} defined as 
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^ j I if Tt ^ 

- sort[n) = < p .. I I 

^ ^ \E if n= [rt^J 

— sort(O) = J; - sort[suc[w)) = sort[w); 



— sort[pairfw,w' 




I if sortfw) 
E otherwise; 



- sort{enc{wi,- ■ ■ ,Wk,r}^^) = 1 



sortfw') = I 



Again, hy abuse of notation, sor 




EifBweW: sort{w) = E 
1 if\/w e W : sort{w) = 1. 



With our next definition, we statically check if a process uses (the value that will 
bind) X in points where an attacker can grasp it. More in detail, we consider as 
sensitive data those terms that are used as channels or as keys or in comparisons, 
and check that they will never depend on the message M. Otherwise, in the first 
case, the attacker may establish different communications with P[M/x] and 
F[M' /x]; in the second case, the attacker may decrypt a message if M turns 
out to be public; in the last case, the attacker may detect some information 
about M (e.g. if it is not 0, see above). The static check controls that the 
special name n,, never belongs to the sets of values that are associated by the ( 
component of estimates to each occurrence of these sensitive data. Note that we 
allow decomposing a term containing x; we only forbid, in a lazy way, that x is 
used to alter the flow of control. 



Definition 7. The process P[x) is invariant w.r.t. x and [p,K,Q if and only if 
for all occurrences of 

— terms {Vi, - ■ ■ ,Vk,{nr)r}Ni , are s.t. sort[([l)) = I; 

— prefixes M'‘{V).P and M\y).P and constructs let {y,z) = M* in P; 
case of 0 : P suc{y) : Q; case of {j/i, • • • , 

n* ^ ()(/) and sort{({l')) = I; 

— constructs [M'' is N'' ]P, are s.t. sort{({l)) = sort{({l')) = I. 

Before defining our notion of message independence we need to adapt testing 
equivalence. Basically two processes are testing equivalent [13,10] if they pass 
exactly the same set of tests, i.e. if one process is ready to communicate with 
any partner then so is the other, and viceversa. 

Definition 8. Let P,P' and Q be closed processes and let fd be m or m. The 
process P passes a public test {Q,(3) if and only if fn[Q) C P and {P\Q) — ^ 

T 0 

Qi' ■ ■ — ^ Qn — t A, for some n > 0, some processes Qi, • • • , Qn (ind some 
agent A. The two processes P and P' are public testing equivalent, in symbols 
P P' , if\l[Q,l3), if P passes {Q,/3) then P' passes {Q,l3) and viceversa. 

Message independence of a process P[x) then merely says that no external ob- 
server can determine the term instantiating the variable x. 

Definition 9. A process P[x) is message independent iff P[M /x] P[M'/x] 

for all closed messages M and M' . 




40 



C. Bodei et al. 



Finally, we establish that a conhned and invariant process is message indepen- 
dent; our formulation offers an alternative to Abadi’s approach, based on type 
systems. Moreover, our formulation sheds light on the role played by conhdential- 
ity in non interference. It is crucial to keep conhdential secrets for not exposing, 
either directly or indirectly, the values that can be bound to the free variable x. 

Theorem 5. If Fix) is confined (w.r.t. S containing n^.) and invariant (w.r.t. 
X and the same solution), then it is message independent. 

6 Conclusion 

Control Flow Analysis has already been successfully used for studies of security 
in the 7r-calculus [8] (focusing on direct flows violating conhdentiality) and for 
studies of mobility in the Mobile Ambients [22,18] (focusing on hrewalls). 

Here, we have proved that our overall approach to direct flows does scale up 
to a calculus with perfect cryptography, despite the need to use more advanced 
techniques for efficiently implementing the analysis. Prior to that, we have also 
overcome a weakness in previous formulations of perfect symmetric cryptogra- 
phy, usually formulated using algebraic identities, by dehning its properties as 
part of the semantics of the i^spi-calculus. 

Our second technical result was to show that our approach is also amenable 
to the treatment of indirect flows, in the form of non-interference results, thereby 
obtaining results similar to those obtained using type systems. Indeed, we have 
factored conhdentiality out of non interference. This separation of concerns may 
clarify the relationship between the two properties and may help checking them 
separately. 
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Abstract. This paper presents a very simple consensus protocol that 
converges in a single communication step in favorable circumstances. 
Those situations occur when “enough” processes propose the same value. 
(“Enough” means “at least (n— /)” where / is the maximum number of 
processes that can crash in a set of n processes.) The protocol requires 
/ < n/3. It is shown that this requirement is necessary. Moreover, if 
all the processes that propose a value do propose the same value, the 
protocol always terminates in one communication step. It is also shown 
that additional assumptions can help weaken the / < n/3 requirement 
to / < n/2. 

Keywords: Asynchronous Distributed System, Consensus, Crash Fail- 
ure, Message Passing. 



1 Introduction 

The Consensus problem is now recognized as being one of the most important 
problems to solve when one has to design or to implement reliable applications on 
top of an unreliable asynchronous distributed system. Informally, the Consensus 
problem is defined in the following way. Each process proposes a value, and all 
non-crashed processes have to agree on a common value which has to be one of 
the proposed values. The most important practical agreement problems (such as 
Atomic Broadcast, View Synchrony, Weak Atomic Commitment, Atomic Mul- 
ticast, etc.) can be reduced to Consensus, which can be seen as their greatest 
common suh-prohlenrt\ Consequently, a distributed module implementing Con- 
sensus constitutes a basic building block on top of which solutions to practical 
agreement problems can be built. This explains why the Consensus is a funda- 
mental, and justifies the large interest the literature has brought to it. 

Solving the Consensus problem in asynchronous distributed systems is far 
from being a trivial task. In fact, it has been shown by Fischer, Lynch and 
Paterson [4] that there is no (deterministic) solution to this problem as soon 
as processes (even only one) may crash. Two major approaches have been pro- 
posed to circumvent this impossibility result. One lies in the use of randomized 
protocols [2]. The other lies in the unreliable failure detector concept, proposed 
and investigated by Chandra and Toueg [3]. Several failure detector-based con- 
sensus protocols have been designed ([11] presents a general approach to solve 

* This author is supported by a grant of the CNPq/Brazil #200323-97. 
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the consensus problem in asynchronous systems equipped with Chandra- Toueg’s 
failure detectors). Interestingly, a Hybrid approach combining failure detectors 
and random number generators has also been investigated [1,12]. 

To converge towards a single decided value, a consensus protocol makes the 
processes exchange proposed values. Each exchange constitutes a communication 
step. So, an interesting measure of the efficiency of a protocol is the number of 
communication steps it requires. In the best scenario, the consensus protocols 
proposed so far require that processes execute at least two communication steps. 

This paper presents a novel and surprisingly simple consensus protocol that 
allows processes to decide in a single communication step when “enough” pro- 
cesses propose the same value. “Enough” means at least (n — /), where n is 
the number of processes and / is the maximum number of them that can be 
faulty. This protocol requires / < n/3. Although failures do occur, they are rare 
in practice. This observation shows that the / < n/3 requirement is not really 
constraining. Moreover, it is shown that it is actually necessary when the ini- 
tial knowledge of processes is limited to n and /. The paper also shows that, 
when the processes are initially supplied with more information, the / < n/3 
requirement can be weakened to / < n/2. 

2 System Model and Consensus 

Asynchronous System,. The system model is patterned after the one described in 
[3,4]. It consists of a finite set iT of n > 1 processes, namely, U = {pi, . . . ,p„}. 
A process can fail by crashing, i.e., by prematurely halting; a crashed process 
does not recover. A process behaves correctly (i.e., according to its specification) 
until it (possibly) crashes. By definition, a correct process is a process that does 
not crash. A faulty process is a process that is not correct. As indicated in the 
Introduction, / denotes the maximum number of processes that may crash. 

Processes communicate and synchronize by broadcasting and receiving mes- 
sages through channels. Communication is reliable: there is no message creation, 
alteration, duplication or loss. If a process crashes while broadcasting a message 
m, only a subset of processes can receive m. There are assumptions neither on 
the relative speed of processes nor on message transfer delays. 

The Consensus Problem. In the Consensus problem, every process pi proposes 
a value Vi and all correct processes have to decide on some value v, in relation 
to the set of proposed values. More precisely, the Consensus problem is defined 
by the following three properties [3,4]: 

— Termination: Every correct process eventually decides some value. 

— Validity: If a process decides v, then v was proposed by some process. 

— Agreement: No two processes (correct or not) decide differently. 

Additional assumption. Our aim is to provide a consensus protocol that termi- 
nates in one communication step in good scenarios (i.e., when enough processes 
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Function Consensus(tii) 

Task TV. 

( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

(6) 

( 7 ) 

(8) 

Task T2: 

( 9 ) npon reception of DECISIOn(i>): broadcast DECISIOn(i>); return{y) 

Fig. 1. The Consensus protocol 

do propose the same value), but also terminates in bad scenarios. So, we consider 
that the underlying asynchronous distributed system allows to solve the consen- 
sus problem. More precisely, we assume it is equipped with a black box solving 
the consensus problem, and we provide a protocol that decides in one communi- 
cation step in good scenarios and uses the underlying consensus protocol in the 
other cases. A process pi locally invokes it by calling Underlying_Consensus(t'i) 
where Vi is the value it proposes. 

3 The Protocol 

Underlying Principle. The idea that underlies the design of the protocol is very 
simple. It comes from the following observation: if all the processes initially 
propose the same value, then this value is necessarily the decided value, whatever 
the protocol and the system behavior. Hence, the proposed protocol executes a 
first communication step during which the processes exchange the values they 
propose. Then, each process checks whether all the processes have the same 
initial value (actually, (n — /) identical values are sufficient). If it is the case, 
this value is decided. If it is not, the underlying protocol is used. 

The Protocol. The protocol is described in Figure 1. A process pi starts a Con- 
sensus execution by invoking Consensus(ui). It terminates it when it executes the 
statement return which provides it with the decided value (at line 4, 7 or 9). To 
prevent a process from blocking forever (i.e., waiting for a value from a process 
that has already decided), a process that decides, uses a reliable broadcast [3] 
to disseminate its decision value. To this end the Consensus function is made of 
two tasks, namely, Tl and T‘2. T1 implements the core of the protocol. Line 4 
and T2 implement the reliable broadcast. 

One Communication Step Decision. Let us consider the case where all the pro- 
cesses that propose a value (those are the processes that have not initially 



broadcast PROPOSED(t>i) ; 

wait until {{n— /) proposed messages have been received); 
if (these messages carry the same estimate value v) 
then broadcast DECISION(r>); return{y) 

else if ((w — 2 /) PROPOSED messages carry the same value v) 

then Vi V endif; 

return(Underlying_Consensus(rii)) 

endif 
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crashed) propose the same value. The protocol makes the processes that do 
not crash decide in exactly one communication step^. 

If less than (n — /) processes propose the same value v, then the consensus 
is solved by the Underlying_Consensus protocol. When there is a set of (n — /) 
processes that propose the same value v, there are two cases according to the 
set of PROPOSED messages received by a process pi at line 2: 

• Case 1: The [n — f) PROPOSED messages received by pi carry v. It follows from 
lines 2-4 that pj decides after one communication step. 

• Case 2: One of the (n — /) PROPOSED messages received by pi carries a value 
different from v. Let us notice that, as there are [n — f) PROPOSED messages 
carrying v and 3/ < n, it follows that necessarily pj receives at least (n — 
2/) PROPOSED messages carrying v, and consequently adopts v at line 6. It 
follows that when (n — /) processes propose the same value v, all the processes 
that do not decide at line 4, invoke Underlying_Consensus withe same value v. 
Interestingly, some consensus protocols expedite the decision when processes 
propose the same value^ . 

4 Proof 

The proof of the Validity property (a decided value is a proposed value) is left 
to the reader. 

Theorem 1. (Termination) If a process pi is correct, then it eventually de- 
cides. 

Proof As there are at least (n — f) correct processes, let us first note that 
no correct process can block forever at line 2. Hence, they all execute line 3. 
According to the results of the test there are two cases: 

— A process decides at line 4. 

In that case, this process has previously sent a DECISION message to all other 
processes. Due to the reliable channel assumption, it follows that if a correct 
process has not yet decided when it receives this message, it executes line 9 
and consequently decides. 

— No process decides at line 4. 

In that case, all the processes that have not crashed during the first communi- 
cation step invoke the underlying consensus protocol. Due to its Termination 
property, all the correct processes eventually decide. 

^Theorem 1 



^ It is important to notice that, in the same situation, the randomized protocols, the 
failure detector-based protocols and the hybrid protocols presented in [1,2,3,11,12] 
do not allow a one step decision. 

^ In that case, [2] allows the processes to decide in two communication steps, while [12] 
requires three steps. Due to the possibifty of false suspicions, failure detector-based 
protocols [3,11] do not enjoy this interesting property. 
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Theorem 2. (Agreement) Let f < n/3. No two processes decide differently . 

Proof Let us first notice that a process that decides at line 9, decides a value 
V that has been sent by a process at line 4. So, we only consider the decision at 
line 4 and line 7. The proof considers three cases. 

— Let us first consider the case where two processes pi and pj decide at line 
4. This means pi received (n — /) PROPOSED messages carrying the same 
value V. Similarly, pj received [n — f) PROPOSED messages carrying the same 
value w. Moreover, each process sends a single PROPOSED message to the 
other processes. As / < n/3, we have (n — /) > n/2. It follows that at least 
one PROPOSEd(p) message and one PROPOSEd(w) message have been sent 
by the same process. Consequently v = w. 

— If no process executes line 4, then the processes that decide execute line 7. 
In that case, due to the Agreement property of the underlying consensus 
protocol, they decide the same value. 

— Let us now consider the case where some processes decide a value (say v) 
at line 4, while other processes decide at line 7. We claim^ that the variable 
Vj of any process pj that executes line 7 has been previously set to v at 
line 6. Then, all the processes that execute the underlying protocol propose 
the same value v to this consensus. Due to the Validity property of the 
underlying consensus, they can only decide v. 

Proof of the claim. Let pj be a process that executes line 4 and pj be a 
process that executes line 5. We have the following: 

1. Pj received [n — f) PROPOSEd(p) messages. Hence, no more than / PRO- 
POSED messages carry a value different from v. 

2. Pj received [n — /) PROPOSED messages. Due to (1), at most / of them 
carry a value different from v. (In the worst case, those / values are 
equal.) 

3. From (1) and (2) we conclude that at least (n — 2/) PROPOSED messages 
received by pj carry the value v. 

4. As n > 3/, we have [n — 2f) > f. This means that the value p is a 
majority value among the values received by pj. 

5. From the test done at line 5, we conclude that pj updates Pj to p, which 
concludes the proof of the claim. 

^Theorem 2 



5 A Necessary Condition 

This section considers an asynchronous distributed system in which the consen- 
sus problem can be solved. Let P be the family of consensus protocols where the 
global knowledge of a process pi is the pair (n, /). 

® Using traditional terminology [3], this claim states how a value decided during the 
first communication step is “focfced”. 




Consensus in One Communication Step 



47 



Theorem 3. Let P <E V. If P allows processes to decide during the first com- 
munication step, then f < n/3. 

Proof Let us first introduce the following parameters related to P £ P: 

- t. number of processes from which a process has to receive a value before 
deciding after one communication step (note that I < (n — f), otherwise the 
protocol could block forever). 

- x: number of messages containing the same value v, that allows a process pi to 
decide that value after the first communication step (note that x < €). 

Let us observe that, as two processes that decide at the end of the first com- 
munication step have to decide the same value, it is necessary that x > n/2. (If 
this was not the case, 'pi could decide vi because it received x copies of it, while 
Pj could independently decide V 2 f Vi because it received x copies of it). 

The proof is by contradiction. Let us assume that P works in a system made 
up of n = Hk processes with k < f. The processes are partitioned into three 
subsets Gi, G2 and G3 of size k. Combining > iik = n with x < £ < [n — f) 
and X > n/2, we get k = n/3 < n/2 < x < £ < [n — f) < [n — k) = 2k < 2f. 
From £ < 2k, we deduce £—k<k<x. Hence, max(fc, £—k)<x. Let us consider 
the following scenario. 

— No process has initially crashed. The processes of Gi propose v; the processes 
of G*2 propose v; and the processes of G3 propose w {f v). 

— Each process pj € Gi receives values from £ < 2k < 2f processes of Gi and 
G*2. As X < t, each process pi £ Gi receives enough copies of v to decide 
(definition of x). So each process of Gi decides v. Then, after having decided, 
the processes of Gi crash. 

— Each process Pi £ G 2 U G^ receives values from £ < 2k < 2f processes of G 2 
and G 3 . More precisely, let us consider the scenario where: 

- Each process of G 2 receives k copies of v and [£ — k) copies of w. 

- Each process of G3 receives [£ — k) copies of v and k copies of w. 

Erom max(fc, £—k) < x, we conclude that no process pi £ G 2 £lG^ can decide. 
It follows that any 'pi € 6^2 U G^ neither decides nor is blocked during the 
first communication step. Consequently, the processes of G 2 U G3 continue 
executing P. Moreover, there is no way for them to know whether processes 
of G\ have decided. The subsets of processes G 2 and G^ are symmetric with 
respect to the number of copies of v and w they have. Hence, whatever P, the 
processes of G 2 116*3 can indistinctly decide v or w. The Uniform Agreement 
property is violated in all the runs of P that decide w. 

This shows that there is no protocol P when n = ik with k < f. A contradiction. 

^Theorem 3 



Corollriry 1. The protocol presented in Section 3 is optimal with respect to the 
number of process crashes tha,t can be tolerated by the protocols ofV. 
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Function Consensus(tii) 

Task 'I’l: broadcast PROPOSED(t>i); 

wait nntil (PROPOSED messages received from a majority of processes); 
if (all the received values are equal to a) 

then broadcast DECISION(a); return{a) 
else if (a received from a process) then Vi ^ a endif; 
return(Underlying_Consensus(rii)) 

endif 

Task T‘2: npon reception of DECISION(ri): broadcast DECISION(ri) ; returniv) 

Fig. 2. Use of a privileged value (/ < n/2) 

Proof The protocol presented in Section 3 trivially belongs to the family V . 
The corollary follows directly from Theorem 3. ^Corollary 1 



6 Considering Additional Assumptions 

This section shows that the previous necessary requirement can be weakened 
when the system satisfies additional assumptions. Those assumptions basically 
enrich the initial knowledge of processes, more precisely they define an “a priori 
agreement” among the processes. We give here two protocols that, with the help 
of such additional assumptions, allow one step decision when / < n/2. 

— In the first protocol the a priori agreement is “value oriented”: there is 
a statically predetermined value that is decided when it is proposed by a 
majority of processes. Hence, here, from an intuitive point of view, the values 
that can be proposed have not the same “power” . 

— In the second protocol the a priori agreement is “control oriented”: there 
is a statically predetemined majority set of processes explicitly used by the 
protocol. Hence, here, from an intuitive point of view, all the processes have 
not the same “power” . 



Existence of a Privileged Value. Let Let a be a predetermined value of the set of 
values that can be proposed. Moreover, let us assume that a is initially known 
by each process. The a priori knowledge of such a predetermined value can help 
expedite the decision when / < n/2 as shown in Figure 2. The idea of the 
protocol is very simple: a process is allowed to decide a in one communication 
step as soon as it knows that a has been proposed by a majority of processes'^. 

Predefined Set of Processes. Let us assume that there is a predefined set of 
processes S that is initially known by each process. The protocol described in 

^ When consensus is used as a sub-protocol to solve the atomic commit problem (see 
[6]), COMMIT can be considered as privileged with respect to ABORT. 
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Function Consensus(tii) 

Task 'I’l: broadcast PROPOSED(t>i); 

wait nntil (PROPOSED messages received from (w — /) processes); 
if (the same value v has been received from each process 6 S') 
then broadcast DECISION(r>); return{y) 
else r>i t— a value from a process 6 S) 
return(Underlying_Consensus(rii)) 

endif 

Task T‘2: upon reception of DECISION(ri): broadcast DECISION(r;); return[v) 

Fig. 3. Predehned set of processes (/ < n/2) 



Figure 3 uses this a priori knowledge to decide in one communication step when 
all the processes of S propose the same value. It requires / < n/2 < |S'|. In this 
solution, the processes are no longer anonymous: their identities are used by the 
protocol. 



7 Concluding Remark 

This paper has presented a consensus protocol that makes the processes decide 
in one communication step when the processes that propose a value propose the 
same value. It has been shown that this protocol requires / < n/3 and that this 
requirement is necessary. It has also been shown how additional assumptions 
allow to weaken the / < n/3 requirement. 

As noted in the Introduction, in practice failures occur but are rare. More- 
over, in some practical agreement problems, processes usually propose the same 
value. This is the case of the atomic commitment problem where, nearly always, 
the processes do propose COMMIT [5]. A reduction of atomic broadcast to con- 
sensus is described in [6] (this reduction involves a preliminary message exchange 
to allow each process to transform the votes it receives into a COMMIt/abort 
proposal). The proposed consensus protocol is particularly attractive to solve 
these agreement problems. 

Very recently, a new and promising Condition-based approach has been pro- 
posed to solve the consensus problem [7]. It consists in identifying sets of input 
vectors for which it is possible to design a consensus protocol that works despite 
up to / faults. Such conditions actually define a strict hierarchy [8]. The efficiency 
of the associated condition-based protocols is investigated in [10]. Moreover, this 
approach reveals to be very general, as it allows to solve more general agreement 
problems [9]. 
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Abstract. In this paper, we describe an approach for the optimiza- 
tion of dedicated co-processors that are implemented either in hardware 
(ASIC) or conhgware (FPGA). Such massively parallel co-processors are 
typically part of a heterogeneous hardware/software-system. Each co- 
processor is a massive parallel system consisting of an array of processing 
elements (PEs). In order to decide whether to map a computational in- 
tensive task into hardware, existing approaches either try to optimize for 
performance or for cost with the other objective being a secondary goal. 
Our approach presented here, instead, a) considers multiple objectives si- 
multaneously. For a given specihcation, we explore space-time-mappings 
leading to different degrees of parallelism and cost, and different optimal 
hardware solutions, b) We show that the hardware cost may be efficiently 
determined in terms of the chosen space-time mapping by using state- 
of-the-art techniques in polyhedral theory, c) Finally, we introduce ideas 
to drastically reduce dimension and size of the search space of mapping 
candidates, d) The feasibility of our approach is shown for two realistic 
examples. 



1 Introduction 

Technical analysts foresee the dilemma of not being able to focus next generation 
hardware complexity because of a lack of mapping tools. On the other hand, the 
next generation of ULSI chips will allow to implement arrays of 10 x 10 32-bit 
micro-processors on a single die and more. Hence, parallelization techniques and 
compilers will be of utmost importance in order to map computational-intensive 
algorithms efficiently to these processor arrays. 

Through this advance in technology, also reconfigurable hardware, sometimes 
also called configware such as FPGAs (field-programmable gate-arrays) [8], be- 
comes more and more attractive as co-processors for the following three reasons: 
1) Chips with up to 10 million gate counts allow to implement arithmetic co- 
processors with hundreds of processing elements, e.g., for image processing and 
linear algebra algorithms, see, e.g., in Fig. 1. Shown is an FPGA placement vi- 
sualized by the tool BoardScope by Xilinx [16] with a square array of processing 

* Supported in part by the German Science Foundation (DFG) Project SFB 376 “Mas- 
sively Parallel Gomputation” . 



V. Malyshkin (Ed.): PaCT 2001, LNCS 2127, pp. 51-65, 2001. 
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Fig. 1. Heterogeneous application, architecture, and hardware/software partition in- 
cluding a massively parallel co-processor implemented in hardware (ASIC) or conhg- 
ware (FPGA). 



elements (PEs), each consisting of an array-multiplier, an adder, registers, and 
some control logic. 2) Configware has the major advantage of being able to reuse 
silicon for time-variant co-processor functions by means of reconfiguration. 3) 
Support for regular designs: standards such as the Java API JBits [16] allow to 
specify the regular design within Java-loops such that lower-level mapping may 
be accomplished efficiently and independent of the problem-size. 

In the eighties and early nineties, higher-level mapping techniques for so- 
called systolic arrays have been in its fancy. They pretty much dealt with the 
problem of mapping a certain algorithm specified by a loop program onto a 
parallel processor array such as a systolic array, and architectural extensions 
thereof with time-dependent and control-dependent processor functions [12]. By 
the use of linear space-time mappings, the relationship between a regular array of 
communicating PEs and the temporal execution of operations of loop algorithms 
has been described. Unfortunately, dedicated hardware chips proposed for certain 
algorithms were too rigid, implementing just a single problem, or too slow and 
expensive due to long time-to-market. 

With the above mentioned advances of silicon technology, and the advent 
of configware, the necessity of mapping tools for parallel hardware processors 
has been rethought and its application scope and processor capabilities broad- 
ened. Some important recent approaches include the PICO-N system by Hewlett- 
Packard [11] that specifies a methodology for synthesizing an array of customized 
VLIW processors starting with a loop program with uniform data dependencies 
and VHDL code at the RTL-level. Erom a given irregular program, parts are 
automatically extracted, mapped to hardware, and finally, the specification is 
modified to make use of this accelerator. Another approach that embeds regular 
array design into heterogeneous hard ware /software targets is the Compaan sys- 
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tem [4], There, Matlab applications are transformed into a network of sequential, 
communicating processes where each process is responsible for computing some 
variables of a nested loop program. 

In this realm, our paper deals with the specific problem of exploring cost /per- 
formance tradeoffs when mapping a certain class of loop-specified computations 
called piecewise regular algorithms [12] onto a dedicated co-processor. The main 
new ideas of our approach are summarized as follows: 

— Simultaneous consideration of multiple objectives: For a given piecewise reg- 
ular algorithm, we explore space-time-mappings^ leading to different degrees 
of parallelism and cost, and different optimal hardware solutions. Existing 
approaches such as [3] consider solutions that find a schedule first (time- 
mapping) such to minimize latency and minimize cost as a secondary goal, 
or the other way round. Such design points are not necessarily so-called 
Pareto- optimal [9] points. 

— Efficient computation of objectives: We show that hardware cost may be 
efficiently determined in terms of the chosen space-time mapping by using 
state-of-the-art techniques in polyhedral theory. 

— Search space reduction: We introduce several ideas to drastically reduce di- 
mension and size of the search space of mapping candidates. 

The rest of the paper is structured as follows. Section 2 introduces the class of 
algorithms we are dealing with. In Section 3, the exploration algorithm for finding 
Pareto-optimal space-time mappings is given. There, the objective functions for 
cost and performance (latency) are explained including the reduction of the 
search space. Finally, results are presented in Section 4. 

2 Notation and Background 

2.1 Algorithms 

In this paper the class of algorithms we are dealing with is a class of recurrence 
equations defined as follows: 

Definition 1. (Piecewise Regular Algorithm) . A piecewise regular algorithm 
contains JM quantified equations 

Each equation Si [i] is of the form 

[t] fi{---: [t dji^ 7 - - -) 

where i G Xi C Z”, Xi [i] are indexed variables, fi are arbitrary functions, 
dji G Z” are constant data dependence vectors, and . . . denote similar arguments. 

^ Although we are able to handle also more general classes of algorithms and mappings, 
introducing them here would unnecessarily complicate the notation and hinder to 
present the main ideas of the exploration approach. 
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The domains li are called index spaces, and in our case defined as follows: 

Definition 2. (Linearly Bounded Lattice). A linearly hounded lattice denotes an 
index space of the form 

I = {1 \ 1 = Mk. + c A AK>b} 

where k G if , M G c G Z", A G Z™^* and b G Z™. {k G Z* | An > 6} 

defines an integral convex: polyhedron or in case of boundedness a polytope in l ) . 
This set is affinely mapped onto iteration vectors I using an affine transformation 
(1 = M K c) . 

Throughout the paper, we assume that the matrix M is square and of full rank. 
Then, each vector k is uniquely mapped to an index point 1. Furthermore, we 
require that the index space is bounded. 

For illustration purposes throughout the paper, the following simple example 
is used. 

Example 1 . Consider a piecewise regular algorithm which consists of three quan- 
tified indexed equations 

a[i,i]= f{a[i-l,j]), V(i = I G I 

- 1]), V(i = 1 el 

c[i,j] = a[i,j] op b[i,j], V(i j)^ = 1 el. 

The data dependence vectors are daa = (1 0)^, d^b = (0 1)^, dac = (0 0)^, and 
dbc = (0 0)^- The index space is given by 
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Computations of piecewise regular algorithms may be represented by a depen- 
dence graph (DC). The DC of the algorithm of Example 1 is shown in Fig. 3(a). 
The DC expresses the partial order between the operations. Each variable of the 
algorithm is represented at every index point i G X by one node. The edges cor- 
respond to the data dependencies of the algorithm. They are regular throughout 
the algorithm, i.e. a[i,j] is directly dependent on a[i — l,j]. The DC specifies 
implicitly all legal execution orderings of operations: if there is a directed path in 
the DC from one node a\J] to a node c[K\ where J,K el, then the computation 
of a[J] must precede the computation of c[K]. 

Henceforth, and without loss of generality^, we assume that all indexed vari- 
ables are embedded in a common index space I. Then, the corresponding de- 
pendence graphs can be represented in a reduced form. 

Definition 3. (Reduced Dependence Graph). A reduced dependence graph 
(RDG) G = [V, E, D,I) of dimension n is a network where V is a set of nodes 
and E C V xV is a set of edges. To each edge e = [vi,Vj) there is associated 
a dependence vector dij G Z”. 

^ All described methods can also applied for each quantification individually. 
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Fig. 2. Reduced dependence graph. 



Example 2. In Fig. 2, the RDG of the algorithm introduced in Example 1 is 
shown. 



2.2 Space-Time Mapping 

Linear transformations as in Equation (1) are used as space-time mappings [7,6] 
in order to assign a processor index p € (space) and a sequencing index: 

t £ h (time) to index vectors i € X. 




In Eq. (1), Q <E and A <E Z^^". The main reasons for using linear 

allocation and scheduling functions is that the data flow between PEs is local 
and regular which is essential for VLSI implementations. The interpretation of 
such a linear transformation is as follows: The set of operations defined at index 
points X ■ I = const, are scheduled at the same time step. The index space of 
allocated processing elements [processor space) is denoted by Q and is given by 
the set Q= {p \ p = Q ■ I A feX}. This set can also be obtained by choosing 
a projection of the dependence graph along a vector u <E Z”, i.e. any coprime^ 
vector u satisfying Q ■ u = 0 [5] describes the allocation equivalently. 

Allocation and scheduling must satisfy that no data dependencies in the DG 
are violated. This is ensured by the well-known causality constraint 

A • dij > 0 \/{vi,Vj) e E. (2) 

A sufficient condition for guaranteeing that no two or more index points are 
assigned to a processing element at the same time step is given by 

rank = n. (3) 

Using the projection vector u satisfying Q - u = 0, this condition is equivalent to 
X-u^O [10]. 

® A vector x is said to be coprime if the absolute value of the greatest value of the 
greatest common divisor of its elements is one. 
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Fig. 3. In (a), the dependence graph of the algorithm introduced in Example 1 is shown. 
Also an allocation given by a projection vector u is illustrated. Counting the number of 
processors is equal to counting the number of integral points in a transformed polytope 
shown in (b) which may be accomplished using Ehrhart polynomials [2]. 
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Fig. 4. Pareto-points for the matrix multiplication example in the objective space of 
latency (L) and cost ((7). 



3 Methodology 

Based on the class of piecewise regular algorithms, we want to explore space- 
time mappings systematically in order to find optimal implementations. Thereby, 
we want to simultaneously minimize several objectives, a multiobjective opti- 
mization problem (MOP). In this paper, we consider the two objectives latency 
L[Q, A) as a measure for the performance, and costC[Q, A) of a processor array. 

As C and L are dependent on Q and A, the search space contains n x n param- 
eters. But as already mentioned, a linear allocation can be described equivalently 
through a coprime projection vector u. Thus, the dimension of the search space 
can be reduced to 2 x n (vector u, vector A). 

Fig. 4 shows a typical tradeoff curve between cost and performance for a 
matrix multiplication algorithm. Different pairs of latency and cost correspond 
to different space-time mappings. As we are concerned with a MOP, there is 
not only one optimal solution but typically a set of optimal solutions, so called 
Pareto- optimal solutions. Our MOP consists of two objective functions C[u,X) 
and L{u,X), where the parameters u and A are denoted as decision variables. 
The optimization goal is to simultaneously minimize C[u,X) and L{u,X) within 
a search space of feasible space-time mappings. 

Definition 4. (Search Space, Decision Vector). Let x = [u A)'^ € denote a 
decision vector and X denote decision space of all vectors x satisfying Eq. (1), 
(2) and (3). 
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Definition 5. (Pareto- optimality). For any two decision vectors a,b € X, a 
dominates b ((7(a) < (7(b) A Lia.) < L(b))V((7(a) < (7(b) A b(a) < L(b)). 
A decision vector x € X *5 said to he non-dominated regarding a, set A C X 
ijf $ a <E A : a. dominates x. Moreover, x is said to be Pareto-optimal ijf x is 
non-dominated regarding X. 

In Fig. 4, the objective space of an example discussed later in Section 4 is shown. 
The white points correspond to Pareto-optimal solutions because they are not 
dominated by any other point. Dominated points are shown in black. 

Now, we are able to formulate our exploration algorithm. For a given RDG 
(7 = (F, E, D,X) and a set U of projection vectors u, our exploration methodol- 
ogy works as follows: First, the cost (7 for a given projection vector u is deter- 
mined. For this allocation, the minimal latency L is computed. Afterwards, we 
determine if the design point is non-dominated with respect to the actual set of 
Pareto-optimal solutions. If it is non-dominated, the decision vector [u A)^ is 
added to the Pareto-optimal set, denoted O in the following. Subsequently, the 
set O has to be updated if the new decision vector dominates some other vectors 
in O. In the following algorithm, the main ideas of our exploration methodology 
are described. 

EXPLORE 

IN: RDG, set U of projection vector candidates 

OUT: Pareto-optimal set O 

BEGIN 

FOR each candidate u £ U DO 

G determineNoOfPEs(M) 

L minimize;^{L(M, A)} 

IF (m A)^ is non-dominated with respect to O THEN 
O ^Oo{{u A)'^’} 

update((7) 

ENDIF 

ENDFOR 

END 

Next, we briefly describe how the cost G and the latency L may be computed. 
Afterwards, we describe how to reduce the set U of candidate vectors u that 
must be investigated. 

3.1 Cost 

For a regular processor array, we are able to approximate the cost, as being 
proportional to the processor count. 

(7(u, A) = )(P E{v) * (cgu T Cf^g(A) T Cwire('^)) (4) 

In Eq. (4), )(PE[u) denotes the number of projected index points when pro- 
jecting the index space I along u (see, e.g.. Fig. 3(a)). The cost for functional 
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units, registers and wiring is denoted by cpijj CRg and cwire- In the follow- 
ing, we assume that processor arrays are resource-dominant: This means that 
cjH’u ^ CRg(A)-|-Cwire(w). Under these assumptions, we obtain the approximation: 

C{u, A) « C{u) = 4j=PE{u) ■ CPU (5) 

As a consequence, the cost of an array is independent of the schedule and propor- 
tional to the number of points in the projected pol 3 dope I. This is also the reason 
why we are able to investigate only the projection vector candidates u £ U and 
minimize the latency L. 

It remains to determine the number of processor elements for a given linear 
allocation. Here, a geometrical approach recently proposed in [1] is applied, for 
illustration, see Fig. 3. In (a), the index space of the algorithm described in 
Example 1 and a projection vector m = (2 1)^ is shown. This linear allocation 
leads to an array of 15 processors. This number of processor elements can be 
determined by a transformation of the given pol 3 dope X. The number of integral 
points inside this transformed polytope is equal to the number of processor 
elements obtained by the projection along u. In [1], it has been shown that this 
problem is equal to a counting problem of the number of integral points in a 
transformed pol 3 dope, see e.g. the polytope shown in Fig. 3(b) for the algorithm 
of Example 1. The number of processors using the projection vector u = (2 1)^ 
results in 15 different projected PEs. This is exactly the number of integral 
points inside the polytope shown in Fig. 3(b), see [1] for details. A state-of-the- 
art solution to the final counting problem is to use so-called Ehrhari polynomials^ 
[ 2 ]- 

3.2 Latency 

In this section, a short description is given how the latency for a given piecewise 
regular algorithm and a given schedule vector A is determined. For approximation 
of the latency, the following term is used 

L = max{A • /} — min {A • /} = max {A -(/2 — Ii)}. 

/GX JgX Ji,/2GX 

The latency minimization problem in algorithm EXPLORE may be formulated as 
a mixed integer linear program (MILP) [14,13]. This well-known method is used 
here during exploration as a subroutine. In this MILP, the number of resources 
inside each processing element can be limited (determining cpij)- Also given is 
the possibility that an operation can be mapped onto different resource types 
(module selection), and pipelining is also possible. As a result of the MILP, we 
obtain: 

— the minimal latency L, 

— the according optimal schedule vector A, 

— the iteration interval® P, 

^ Due to space limits, we omit the details of this procedure. 

® 'I'he iteration interval F of an allocated and scheduled piecewise regular algorithm 
is the number of time instances between the evaluation of two successive instances 
of a variable within one processing element [14]. 
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— the start times of each Vi € V, within the iteration interval 

— the selected resource type for each Vi (E V. 

Here, only the latency is used for rating the performance. The other values, 
however, are necessary for simulation and synthesis. We will present a detailed 
example of this procedure in Section 4. 

In the following, we introduce two new additional methods how to reduce 
the search space for Pareto-optimal space-time mappings. 

3.3 Projection Vector Candidates 

Let X C XX be a linearly bounded lattice according to Definition 2. In the 
following, we investigate projection vectors for the polytope P = {k € Z” | An > 
6}. By our assumption that the lattice matrix M has full rank, projection vectors 
u' € Z” for V may be transformed to a corresponding projection vector u € Z” 
in X by M = Mu' . 

For the exploration, it is necessary to determine a set U of projection vector 
candidates. This search space may be bounded as follows: Note that a projection 
vector may not be optimal if not at least two points ki,K 2 € V are projected 
onto each other: 



— K 2 = au' , (J G Z. (6) 

Hence, the search space may be bounded by the set of possible differences of two 
points in V , the so-called difference body T> of V [15], which again is a pol^ope. 

V = {k € Z” I K = Kl — K2 a Ki, K2 G P} . 

The dual of T> is convex and symmetric about the origin (see, e.g., in Fig. 5 
for the polWope V in Fig. 3(a)). From duality, = {k G Z” | k . > 6^} is 
the intersection of closed half-spaces. Furthermore, let B C Z” be the smallest 
n-dimensional box [bounding box) containing P^. 

In the following, a procedure for the reduction of suitable projection vector 
candidates is described: 

— Compute all vertices V of the pol^ope P. 

— For each pair Vi,Vj G V compute the vertex difference Wj — Vj. The set of 
vertex differences is denoted by V“. 

— Determine the dual representation of V^. This is the convex pol^ope P^. 
Also determine the bounding box B of . 

— Iterate over all points u' G B. For the reason V~ is symmetric about the 
origin, also B is symmetric about the origin. Due to symmetry, it is only 
necessary to consider, e.g., for the first component of u' all positive values. 
Furthermore, the selected projection vectors u! have to be coprime. Finally, 
test if u' is in P^. If u' G P^, the condition in Eq. (6) that at least two 
point mapped onto each other is satisfied. 
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Fig. 5. Difference body of the convex polytope from Fig. 3(a). 



Example 3. Reconsider the pol 3 dope shown in Fig. 3(a) with the vertices 




All differences Vi — Vj, i,j € [1,4], i y are marked in Fig. 5 as white small 
boxes. is bounded by the black, B by the dashed line. Due to symmetry, 
only the upper half-space has to be explored. All coprime integral points (i y)^, 
i € [—9,9], j € [0,7] which lie inside V~ are projection vector candidates. 



3.4 Further Reduction of the Search Space 

The order in our exploration algorithm to determine the cost first has the ad- 
vantage that possibly the search space can be reduced further by adding a more 
restrictive constraint to the MILP for latency minimization: Let O be the set of 
so far determined Pareto-points (see Fig. 6). The dashed line denotes the com- 
puted cost of a design point (uj Ay)^. If this design point shall be Pareto-optimal, 
obviously L[\j) must be smaller or equal to the latency L[\i) of all such points 
Oi £ O for which the cost G[ui) is smaller or equal to G[uj): 

IF (3 Oi = {ui Ai)T G O I G{ui) < G{uj)) THEN 
let Oi £ O he the Pareto-point for which 
maXo^gc>{C'(Mj) [ G[ui) < G[uj)} holds 
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Fig. 6. Pareto-points obtained through design space exploration for the algorithm in- 
troduced in Example 1. 



IF {C{ui) < C{uj)) THEN 
add constraint L[\j) < L[\i) to MILP 
ELSE 

IF [C{ui) = C{uj)) THEN 
add constraint L[\j) < L[\i) to MILP 
ENDIF 
ENDIF 
ENDIF 



4 Results 

First, space-time mappings for the algorithm introduced in Example 1 are ex- 
plored. The bounding box (Fig. 5) contains 295 integral points as candidates for 
projection vectors. When symmetry is explored and only coprime vectors are 
considered, U is reduced to 45 candidates. For each of these projection vectors, 
the cost G is determined. Subsequently, the latency is minimized. The results 
are visualized in Fig. 6, the Pareto-optimal solutions are the white points and 
presented in Table 1. The MILP was solved for execution times of 1 unit for /(a) 
and g{b). For op, we considered 4 time units. From the solution of the MILP, 
we obtain the schedule vector A, the iteration interval P and as well all starting 
times for each operation within the iteration interval. In the following, we take 
a closer look at the solution for u = (2 1)^. The corresponding iteration interval 
is 4 and the starting points are r(a) = 0, r(fe) = 0, and r(c) = 1. In Fig. 7, the 
scheduling for the processors p = 3, 4, and 5 is shown. The data dependencies 
between adjacent index points are visualized by arcs. 

The second example is a matrix multiplication algorithm. The product C = 
A ■ B of two matrices A e and B e jg defined as follows 




Design Space Exploration for Massively Parallel Processor Arrays 



63 




Fig. 7. Bar chart of scheduled algorithm. 

Table 1. Pareto-points of the design space exploration for the algorithm in Example 1. 
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A corresponding piecewise regular algorithm is given by 

input operations 
a[i,0, k] t- Oik 
bp,j,k] ^ bkj 
c[a j,0] ^ 0 
computations 
a[i,hk] ^ a[i,j - l,k] 
b[i,j,k\ ^b[i- 1, j, A;] 
z[i,j,k\ ^ a[i,j,k\ • fe[i, j, A;] 
c[i,j,k\ ^ c[i,j,k- l] + z[i,j,k] 
output operations 
Cij ^c[i,j,N^] 

where the index space is 

X = {/ = (i j A;) ^ e I 1 < A < iVi A 1 < j < iV2 A 1 < A; < TVs}. 

The input operations a and b are mapped each to one resource of type input. The 
execution times of these operations are zero. This is equivalent to a multi-cast 
without delay to a set of index points. For the multiplication (variable z), an 
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Table 2. Pareto-points of the design space exploration for the matrix multiplication 
algorithm. 
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execution time of 4 time units is considered, whereby the multiplier is pipelined, 
being able to start a new execution every two time units. The addition (variable 
c) takes three time units and by use of pipelining is able to start each time unit 
a new operation. 

An exploration for = 4, A 2 = 5 and A 3 = 2 has been performed. The 
search space of |[— 4,4]| • |[— 5,5]| • |[— 2,2]| = 9 • 11 • 5 = 495 projection vector 
candidates can be reduced to 83 using our reduction techniques. The results are 
visualized in Fig. 4. We obtain three Pareto-optimal solutions shown in Table 2. 



5 Conclusion and Future Work 

We have presented a first approach for systematically exploring Pareto-optimal 
space-time mappings for a class of algorithms with uniform data dependencies. 
The considered objective functions are cost and performance (latency). In our 
exploration algorithm we introduced also several new techniques for reduction 
of search space for Pareto-optimal space-time mappings. 

Our exploration framework is part of the PARO® design system that supports 
also the automated synthesis of regular circuits. 

In the future, we would like to extend the presented results to include energy 
consumption as an additional objective and to perform symbolic design space 
exploration for parameterized index spaces. 
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Abstract. A model called global cellular automata (GCA) will be in- 
troduced. The new model preserves the good features of the cellular 
automata but overcomes its restrictions. In the GCA the cell state con- 
sists of a data held and additional pointers. Via these pointers, each cell 
has read access to any other cell in the cell held, and the pointers may 
be changed from generation to generation. Compared to the cellular au- 
tomata the neighbourhood is dynamic and dihers from cell to cell. For 
many applications parallel algorithms can be found straight forward and 
can directly be mapped on this model. As the model is also massive 
parallel in a simple way, it can efficiently be supported by hardware. 



1 Motivation 

The classical cellular automata model (CA) can be characterised by the following 
features 

— The CA consists of a n,-dimensional field of cells. Each cell can be identified 
by its coordinates. 

— The neighbours are fixed and are defined by relative coordinates. 

— Each cell has local read access to the states of its neighbours. Each cell 
contains a local rule. The local rule defines the next state depending on the 
cell state and the states of the neighbours. 

— The cells are updated synchronously, the new generation of cells (new cell 
states) depend on the old generation (old cell states). 

— The model is massive parallel, because all next states can be computed and 
updated in parallel. 

— Space or time dependent rules can be implemented by the use of special 
space or time information coded in the state. 

The CA is very well suited to problems and algorithms, which need only 
access to their fixed local neighbours [7]. Algorithms with global (long distance) 
communication can only indirectly be implemented by CA. In this case the 
information must be transported step by step along the line from the source cell 
to the destination cell, which needs a lot of time. Therefore the CA is not an 
efficient model for global algorithms. 
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We have searched for a new model, which preserves the good features of the 
CA but overcomes the local communication restriction. The new model shall be 
still massive parallel, but at the same time suited to any kind of global algorithm. 
Thus we will be able to describe more complex algorithms in a more efficient 
and direct way. We have also investigated how this model can efficiently be 
implemented in hardware. 

2 The GCA Model 

The model is called global automata model (GCA). The GCA can be charac- 
terised by the following features 

— A GCA consists of a n,-dimensional field of cells. Each cell can be identified 
by its coordinates. 

— Each cell has n individual neighbours which are variable and may change 
from generation to generation. The neighbours are defined by relative coor- 
dinates (addresses, pointers). 

— The state of a cell contains a data field and n address fields. 

State = (Data, Address 1, AddressB, ...) 

— Each cell has global read access to the states of its neighbours by the use of 
the address fields. 

— Each cell contains a local rule. The local rule defines the next state depending 
on the cell state and the states of the neighbours. By changing the state, the 
addresses may also be changed, meaning that in the next generation different 
neighbours will be accessed. 

— The cells are updated synchronously, the new generation of cells depends on 
the old generation. 

— The model is massive parallel, because all next states can be computed and 
updated in parallel. 

— Space or time dependent rules can be implemented by the use of special 
space or time information coded in the state. 

A one-dimensional GGA with two address fields will be defined in a formal 
way, using a PASGAL like notation: 

1. The cell field 

Cell = array [0..n-l] of State 

2. The State of each cell 

State = record 
Data: Datatype 
Address 1 : 0..n-l 
Address2 : 0..n-l 
endrecord 

3. The definition of the local rule 
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Addressi Address2 




next State 



Fig. 1. The GCA model. 



function Rule (Self : State, 

Neighbourl : State , Neighbour2 : State) : State 

4. The computation of the next generation 

for i:=0..n-l do in parallel 

Cell [i] := Rule (Cell [i] , Cell [Addressi] , Cell [Address2] ) 
endf or 

Fig. 1 shows the principle of the GCA model. Cell[i] reads two other cell 
states and computes its next state, using its own state and the states of the two 
other cells in access. In the next state, Cell[i] may point to two different cells. 

The above model can be defined in a more general way with respect to the 
following features 

— The number k of addresses can be 1, 2, 3... If k=l we call it a one-handed 
GCA, if k=2 we call it a two-handed GCA and so forth. 

— The number k may vary in time and from cell to cell, in this case it will be 
a variable-handed GCA. 

— Names could be used for the identification of the cells, instead of ordered 
addresses. In this case the cells can be considered as an unordered set of 
cells. 

— A special passive state may be used to indicate that the cell state shall not 
be changed any more. It can be used to indicate the end of the computation 
or the deletion of a cell. A cell which is not in the passive state is called 
active. An active cell may turn a passive cell to active. 

Similar models (pointer machines) have been proposed before[4,5]. In these 
models nodes are accessed step by step via fixed pointers stored in a tree-like 
structure. In our model any node can immediately be accessed because the whole 







GCA: Global Cellular Automata. A Flexible Parallel Model 



69 



pointer-structure can be changed from generation to generation through address 
calculations. The PSA-model [1] is a model which allows parallel state substitu- 
tions on arbitrary cells in the field. In the PSA-model each cell (or all complex) 
tries to perform the same set of substitution rules, on the same set of neigh- 
bours. In our model each cell has access to individual neighbours and each cell 
may compute a new neighbourhood from generation to generation. 

3 Mapping Problems on the GCA Model 

The GCA has a very simple and direct programming model. The programming 
model is the way how the programmer has to think in order to map an algo- 
rithm to a certain model, which is interpreted by a machine. In our case, the 
programmer has to keep in mind, that a machine exists which interpretes and 
executes the GCA model. 

Many problems can easily and efficiently be mapped to the GCA model, e.g. 

— sorting of numbers 

— reducing a vector, like sum of vector elements 

— matrix multiplication 

— permutation of vector elements 

— graph algorithms 

The following examples are written in the cellular programming language 
CDL[2]. CDL was designed to facilitate the description of cellular rules based 
on a rectangular n-dimensional grid with a local neighbourhood. The locality 
of the neighbourhood radius was asserted and controlled by the declaration of 
distance=radm5. For the GCA the new keyword infinity was introduced for 
the declaration of the radius. 

In CDL the unary operator * is used (like in C) to dereference the relative 
address of a cell in order to obtain the state of the referenced cell. The following 
examples are one-handed GCAs, showing how useful unlimited read-access to 
any other cell is. 



3.1 Example 1: Fast Fourier Transformation 

The Fast Fourier transformation (FFT) is our first example. The FFT is used 
to transform a time-discret signal into there frequency components. We do not 
explain the algorithm in detail, because you can find it in many books, e.g. [6]. 
The example is used to demonstrate that a complex algorithm can 

— easily be mapped onto the GCA model 

— concisely be described 

— efficiently be executed in parallel 

Each cell contains a complex number (r,i) which is calculated in every time 
step from its own number and the number contained in another cell. The address 
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Fig. 2. The FFT access pattern. 

of the other cell depends on its own absolute address (position) and the time 
step in the way shown in fig. 2. 

For example, the cell at position 2 reads the cell at position 3 in the first 
step, the cell at position 0 in the next step, and the cell at position 6 in the last 
time step. Obviously this access pattern can not be implemented efficiently on 
a classical cellular automaton using strict locality. 

(1) cellular automaton fft; 

( 2 ) 



(3) 


const dimension 


= 1; 




(4) 


distance 


= infinity; 




(5) 








(6) 


const pi = 3.141592654; 




(7) 








(8) 


type celltype=record 




(9) 


r,i : 


float; /* 


the complex value */ 


(10) 


step : 


integer; /* 


initialised with 1 */ 


(11) 


position: 


integer; /* 


init with 0..(2"k)-l, n 


(12) 


end; 






(13) 









(14) var other : celladdress; 

(15) a,wr,wi :float; 

(16) 

(17) #define cell * [0] /* * [0] means "state of center cell" */ 

(18) 

(19) rule begin 

(20) /* calculate relative address of other cell */ 

(21) other := [(cell .position xor cell . step) -cell .position] ; 

( 22 ) 

(23) /* calculate new values for local r and i */ 

(24) a:=(-pi * (cell .position + (cell . step-1) )) / cell. step; 

(25) 

(26) wr:=cos(a); 

(27) wi:=sin(a); 
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Fig. 3. Hitonic sequenz access pattern. 



(28) 

(29) if (other. X > 0) then 

(30) begin 

(31) /* other cell has higher number */ 

(32) 

(33) cell.r := cell.r + wr * (mother. r) - wi * (mother. i) ; 

(34) cell.i := cell.i + wr * (mother. i) + wi * (mother. r) ; 

(35) end 

(36) else 

(37) begin 

(38) /* other cell has lower number */ 

(39) 

(40) cell.r := mother. r - (wr * cell.r - wi * cell.i); 

(41) cell.i := mother. i - (wr * cell.i + wi * cell.r); 

(42) end; 

(43) 

(44) h step = 1,2, 4, 8 , ... */ 

(45) cell. step := 2 * cell. step; 

(46) 

(47) end; 

The algorithm is concise and efficient because the address of the neighbour 
is calculated (line (21)) and thereby an individual neighbour is accessed (lines 
(33) and (34)). The listing of the TFT without using this feature would be at 
least twice as long and the calculation would take significantly more time. The 
time complexity is 0(n) = ld[n), with n = number of cells/positions (n = 2*). 

3.2 Example 2: Bitonic Merge 

The bitonic merge algorithm sorts a bitonic sequence. A sequence of numbers is 
called bitonic, if the first part of the sequence is ascending and the second part is 
descending, or if the sequence is cyclically shifted. Consider a sequence of length 
n = 2*. In the first step cells with distance 2* ^ are compared, fig. 3. 

Their data values are exchanged if necessary to get the minimum to the left 
and the maximum to the right. In each of the following steps the distance for 
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the cells to be compared is halve of the distance of the preceding step. Also with 
each step the number of subsequences is doubled. There is no communication 
between different subsequences. The number of parallel steps is K = ld(n). 



( 1 ) 

( 2 ) 

(3) 

(4) 

(5) 

( 6 ) 

(7) 

( 8 ) 
(9) 

( 10 ) 

( 11 ) 

( 12 ) 

(13) 

(14) 

(15) 

(16) 

(17) 

(18) 

(19) 

( 20 ) 
( 21 ) 
( 22 ) 

(23) 

(24) 

(25) 

(26) 

(27) 

(28) 

(29) 

(30) 

(31) 

(32) 

(33) 

(34) 

(35) 

(36) 

(37) 

(38) 

(39) 

(40) 

(41) 



cellular automaton bitonic_merge; 
const dimension = 1 ; 

distance = infinity; 

type celltype=record 

/* data is initialized by a bitonic sequence */ 
data : integer; 

/* own_pos is initialized by 0..(2"k)-l */ 
own_pos : integer; 

/* other_pos initialized by (2"k)/2 */ 
other_pos : integer; 
end; 

var other : celladdress; 
w,a : integer; 

#define cell * [0] 

rule begin 

if ( (cell . own_pos and cell . other_pos) = 0 ) then 
begin 

/* relative address of cells with higher numbers */ 
other := [cell . other_pos] ; 
w := mother. data; 
a := cell. data; 

/* comparator */ 
if (w < a) then cell. data := w; 
end 
else 
begin 

/* relative address of cells with lower numbers */ 
other := [-cell . other_pos] ; 
w := mother. data; 
a := cell. data; 

/* comparator */ 
if (a < w) then cell. data := w; 
end; 

/* access-pattern is (2"k)/2, ... ,4,2,1 */ 
cell . other_pos := cell . other_pos / 2; 
end; 




GCA: Global Cellular Automata. A Flexible Parallel Model 



73 



4 Conclusion 

We have introduced a powerful model, called global cellular automata (GCA). 
The cell state is composed of a data field and n pointers which point to n 
arbitrary other cells. The new cell state is computed by a local rule, which takes 
into account its own state and the states of the other cells which are in access 
via the pointers. In the next generation the pointers may point to different cells. 
Each cell changes its state independently from the other cells, there are no write 
conflicts. Therefore the GCA model is massive parallel meaning that it has a 
great potential to be efficiently supported by hardware. We plan do implement 
the GCA model on the CEPRA-S processor[3]. 

Parallel algorithms can easily be described and mapped onto the GCA. Com- 
pared to the CA model it is much more flexible although it is only a little more 
complex. 
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Abstract. This paper is oriented to algorithm architecture for compu- 
tation of polynomials. The algorithm is based on “divide and conquer” 
method and performed in terms of a model of hne-grained parallelism - 
Parallel Substitution Algorithm. 



1 Introduction 

It is known that the efficiency of computation of most numerical functions is 
in strong dependence on the efficiency of polynomial computation. In such a 
way, the problem of polynomial approximation is standard and often-used oper- 
ation. As a result, more attention is paid to designing high-speed algorithms for 
polynomial computation, which are also suitable for VLSI implementation. 

In this paper we describe a cellular-pipelined architecture of the algorithm for 
polynomial computation. The interest in cellular algorithms is associated with 
their properties: homogeneity, maximal parallelism and high-tech mapping into 
VLSI. 

Parallel Substitution Algorithm (PSA) [1,2] is used for the above algorithm 
design and modeling. Unlike other cellular models, PSA properties and expres- 
sive capabilities allow to represent any complex algorithm. Moreover, there is 
one-to-one correspondence between PSA and automata net, that forms the basis 
for the architectural design. Traditionally, the Horner scheme is used for polyno- 
mial computation. It requires 0(n) steps, where n is a degree of a polynomial. 
Systolic algorithms for polynomial computation are widely covered in the liter- 
ature. To reduce the time complexity of algorithm to O(logn) steps, we employ 
“divide and conquer” method [3]. (TFT is the best illustration of use of this 
method and properties of complex roots of one [3].) Degree of exploitation of the 
method parallelism is determined by the size of a cellular array. 

The presented algorithm computes polynomials in an array of restricted size 
in time (I + 13) [log n] + [log/] — 2, where I is the length of the polynomial 
coefficients and the variable, \x\ identifies the ceilling of x. 

The article is organized as follows. In the second section “divide and conquer” 
method for computing polynomials is given. The cellular-pipelined algorithm 
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Ag = A?®^ + Ag 

Fig. 1. The computational tree 



architecture for computing polynomials and its time complexity are discussed in 
the third section. 

2 Method for Computing Polynomials 

Let P[x) = + • • • + Uoj 0, be a polynomial of degree n 

and let n = 2* — 1 .Then 

F{x) = + 02^-23^^"^^ H H do- (1) 

Using “divide and conquer” method, the polynomial (1) is represented as 

2p-'“)_l 

F{x)= Y. Al{x)x^^-\ 

m=0 

where A^(x) = a™, A^{x) is m-th partial polynomial of degree less or equal to 
2*^^. Since 

m=0 m=0 

we have the following recurrence 

-4™ ^(a;) = 0<A;<i-l, 0 < m < 2"^* - 1, (2) 

where k is the index of recursion. Let us call 7l.2m^i(a:) and A 2 m{x) in (2) the 
first and the second coefficient of m-th partial polynomial, respectively. 

So, computation of F[x) of degree n is reduced to recursive computation of 
the partial polynomials (2) and their composition. The result is formed in time 
O(logn). In Fig. 1 an example of computation of a polynomial of degree 7 is 
given. 
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3 Cellular-Pipelined Algorithm Architecture 

In this section, at first we discuss computational scheme of the presented al- 
gorithm then we take a quick look at the cellular algorithm architecture and 
estimate the time complexity of the presented algorithm. 

3.1 Calculation Scheme of Algorithm 

Let F[x) be a polynomial of degree n. For simplicity, we shall make the following 
assumptions: 0<a<l, p = 0,1, . . . n, 0 < x < 1, XlILo cii < h the initial and 
immediate data are /-bit binary numbers, and / > n. It is required to calculate 
F[x) in an array of size (/ x /) for multiplying. 

The cellular-pipelined algorithm for computing sum of products [2] is used 
for forming the partial polynomials and the squares. The algorithm carries out 
the products in a redundant form in an array of size (/ x 1) with the period equal 
to 4 steps. High pipelining is achieved due to the following. The multipliers 
are loaded digit serially, the least significant bit first. The multiplicands are 
loaded digit parallelly at 4 step intervals. The fast carry-save technique is used 
for summing. The first product is obtained at the (/ + l)-th step, the second 
product - in 4 steps, and so on. 

The computational scheme of the algorithm is shown in Fig. 2. It is obtained 
from the scheme (Fig. 1) by the computation pipelining by two parameters : by 
k, k = 0,1, . . . ,i — 1, and by m, m = 2*^* — 1, ... ,0. For given k the square 
of number (x^ xF = x^ ^ ) is carried out at first and then the products 
H. 2 m^i(x)x^ are computed beginning from maximum value of m. Let us call 
the partial polynomial A^^(x) for maximum value of m a /eadm(/ polynomial of 
degree less or equal to 2*. Using algorithm [2], the leading polynomial of degree 
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Fig. 3. Architecture for computing a polynomial 

1 (A 3 ) is ready at the (/ + 8 )-th step (the summation requires 3 steps). Hence, 
beginning from the 9-th step the algorithm can compute the leading polynomial 
of degree 2 . 

All intermediate results are in the redundant form. Only the square is trans- 
formed into a nonredundant form. This transformation does not require an ad- 
ditional time. In this case computing the products A 2 ^^i{x)x'^ and the square 
, A; > 1, are reduced to two multiplications and one summation. 

3.2 Cellulrir-Pipelined Algorithm Architecture 

Algorithm architecture corresponds to its calculational scheme and is given in 
Fig. 3. Data to be processed are allocated in 10 arrays. 

The initial data are placed as follows. X' stores the multiplier (x). A' , A and 
two first rows of the 0-th layer of the array M' of size ((1 + 2) x / X 2) store 
the multiplications (the coefficients of T’(x)), moreover, the even coefficients are 
stored in the 0-th of A' . In Fig. 3 the first pair (x, x) to be multiplied is marked. 
In the array X' the least significant bit of x is placed in the top cell, in the array 
M' the least significant bit of x - in the rightmost cell of the 2-nd row of M' . 

The 0-th layer of the array M of size (/ + 2) x (/ + 1) X 2 is intended for 
computing the products and the squares. Each result (a two-row code) (c, s) 
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is obtained in two last rows of the 0-th layer of M' . The code of the square 

^ = (c, s)^ ^ is transferred into two first rows of the f-st layer of M' . The 
two-row code of product = (c, is supplied number serially in 

the 0-th layer of Adi (Carry save adder (CSA)). 

The 0-th layer of the array Adi of size 3 x / X (n/4+ 1) is used for computing 

A’^^x) = + A^^{x). (3) 

The coefficient A^^Xx) is loaded from A' into Adi in advance. If = 0, then 
^L( x) = is placed in the 2-nd row of Adi. If A; > 1, then two-row code of 
coefficient the Af^mi^) i® placed in the f-st and the 2-nd rows of Adi. 

So, the algorithm formed the partial polynomial. If the obtained result is the 
first coefficient of the polynomial then its code is placed in two first 

rows of the 0-th layer of M' . Otherwise the obtained code is dropped into one of 
the layers of Adi. Before computing the polynomial this code is turned 

back into the 0-th layer of Adi. 

The f-st layer of the array M forms the one row code of the square. The 
obtained result is loaded into X' digit serially, starting from the f-st digit. 

The result is accumulated in the form of the two-row code in the array Adi, 
which is then transferred into Ad2 (carry-look-ahead adder (CLA)) to sum the 
last numbers. 

Data loading is performed under the control of the arrays C^, Cjvf , 6 ( 4 /. Data 
processing is performed under the control of Cm, CAdi , Ca', and CAd 2 - 

The cellular algorithm consists of two procedures carried out successively 

— computing the partial polynomials and the square, 

— transformation of two-row code of the square. 

The procedure of computing the partial polynomials and the squares is based 
on the cellular-pipelined algorithm for computing a sum of products. 

The transformation procedure is reduced to peformence the transformation 
(c, s)^ — ^ .It takes [I + 1) steps. Each step consists of adding two boolean 

integers and and shifting of the result (sum and carry) one digit to the 
write. Four rules {0\, 6 > 2 , 6>3 and © 4 ) and the example are given in Fig. 4. 

3.3 Time Complexity 

The time complexity of the cellular algorithm is the following sum 

[logn]-l 

d =^21+ 'y ^ d-2^ + fp + + tcLA- 

t=2 

Here t 2 i and ^23 ^ the time needed to compute the squaters x^ and x^^ ,j = 
2,3, ... , [logn] — 1, respectively. t 2 i = (/ + 3) steps (2 steps is required to load 
two-row code of x^ into Adi). t 2 i = {I + 13) steps (5 steps is required to 
transfer two-row code of x^^ into the f-st layer of M and to obtain the first digit 
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Fig. 4. Transformation procedure: a) the summation rules, b) the example of the trans- 
formation 2 — 1 in the 1-st layer of M 



of its one-row code, (/ + 8) steps needed to compute two-row code of ). As 
the algorithm computes the leading polynomial of degree 2^ 8 steps after the 
computation of the squater , j > I, tp = 8. ti - the time needed to generate 
the last leading partial polynomial (P(x)), t; = (/+ If) steps. tcLA ^ the time is 
required to transfer the result from CSA into CLA (2 steps) and to sum up two 
last numbers (riog/1 steps). So, the algorithm computes a polynomial in time 
7'= (/ + 13)riogn] -2 + [log/]. 

4 Conclusion 

In this paper, we present the new cellular algorithm architecture for com- 
puting polynomials of degree n. The algorithm computes a polynomial in time 
(;+13)[logn] + [log!] -2. 
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Abstract. This paper introduces MetaPL, a notation system designed to 
describe parallel programs both in the direct and in the reverse software 
engineering cycle. MetaPL is an XML-based Tag language, and exploits XML 
extension capabilities to describe programs written in different programming 
paradigms, interaction models and programming languages. The possibility to 
include timing information in the program description promotes the use of 
performance analysis during software development. After a description of the 
main features of the notation system, its use to obtain two particular program 
descriptions {views) is shown as an example. 



1 Introduction 

Currently the use of parallel hardware and parallel programs for solving computation- 
intensive problems is customary. Unfortunately, the development of parallel software 
is still carried out ignoring systematically software engineering principles and 
methods. Most of the times, the obtained programs are badly structured, difficult to 
understand, not easy to maintain or re-use. Sometimes, they even fail to meet the 
desired performance, which is the primary reason for resorting to parallelism. The 
first cause of this state of affairs is the absence of a unifying programming approach. 
Traditional sequential software developers have a clear idea of how program 
statements will be executed, and can pay attention to software quality. Parallel 
programmers are instead mobilized for the holy war between the supporters of shared 
memory and message-passing paradigms. 

A second complementary issue is the relatively scarce interest taken by software 
engineering researchers in performance issues. The traditional targets of software 
engineering are functional requirements and how to build software that has few bugs 
and can be easily maintained. In fact, performance-oriented techniques have rarely 
been adopted throughout the software life cycle. It should be also noted that the 
scenario is quite changed, due to the more intensive use of structured software design 
approaches, such as component-based development. Currently there is a widespread 
interest in techniques that allow the design of software for performance. It is worth 
pointing out that, oddly enough, performance problems have not been eliminated, not 
even mitigated, by the availability of cheap and fast hardware. The newness of 
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software environments and hardware, along with the consequent inexperience of 
developers, increases the risk of performance failures. This is particularly true for 
parallel systems, where hidden, non-intuitive and inter-related performance factors, 
computer environment heterogeneity and complexity, and the wide range of possible 
design choices make it very difficult to obtain satisfactory performance levels. 

Software performance engineering (SPE) methods, successfully used for years in 
the context of sequential software, are the obvious solution for the development of 
responsive parallel software systems. The SPE process begins early in the software 
life cycle, and uses quantitative methods to identify among the possible development 
choices the designs that are more likely to be satisfactory in terms of performance. 
Stated another way, SPE makes it possible to discriminate between successful and 
unsuccessful (as far as performance is concerned) designs, before significant time and 
effort is invested in detailed design, coding, testing and benchmarking [1], [2], [3], 
[4], [5]. Starting from the performance problems and the need to study them at the 
early stages of development, new software life-cycles, graphical software views and 
CASE tools were developed, oriented to the development of parallel software [4], [5], 
[6], [7]. Unfortunately, the research efforts in this field have been carried out in many 
different directions in a completely uncoordinated way, thus producing incompatible 
tools based on alternative approaches. 

It is interesting to point out that for sequential software, and in particular for 
Object-Oriented Programming, there are instead many widely accepted standards, 
such as UML, as graphical notation systems. New standards are under development, 
such as XMI [8] for data interchanging between CASE tools. Tools based on these 
standards, initially developed for traditional sequential programming, are beginning to 
be used in distributed systems, and in particular for distributed object systems like 
Corba. However, they are not easily re-utilizable for general parallel programming, as 
the existence of many non-standard UML extensions for parallel programming clearly 
shows. 

Our research group has been active for several years in the field of heterogeneous 
distributed system simulation for performance analysis [1], [9], [10], [11]. The 
development of parallel applications on top of a high-performance simulator, running 
on a workstation or on a scaled-down distributed environment rather than directly on 
the target computing system, has proven to be a simple and profitable solution, able to 
improve software quality and to reduce development costs, even in the absence of 
integration with customary software engineering methodologies. 

Probably the most interesting possibility offered by software development in 
simulation environments is the iterative refinement and performance evaluation of 
program prototypes. Prototypes are incomplete program designs, skeletons of code 
where (some of) the computations interleaved between concurrent process 
interactions are not fully specified. In a prototype, these local' computations are 
represented for simulation purposes by delays equal to the (expected) time that will be 
spent in the actual code. The use of prototypes has shown that this synthetic way of 
describing the behavior of a parallel program is very powerful: it is language- and 
platform-independent, shows only the essential features of software, and can be used 
for performance analysis at the early development stages. 



' By “local” computation, we mean a sequence of executed statements that entail no 
interaction (whether by shared memory or by message-exchange) with other concurrent 
processes or threads. 
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In this paper we describe MetaPL, a notation system designed to be the evolution 
of the concept of prototypes. MetaPL is an XML-based Tag language, and provides 
predefined elements for the description of parallel programs at different levels of 
detail. Using XML peculiar extension characteristics, the capabilities of the notation 
system can be freely expanded whenever necessary. In fact, MetaPL is composed of a 
minimal “core” language; through the introduction of suitable extensions, it is 
possible for the software developer to describe virtually any parallel/distributed 
software system. The extension mechanism is also used to obtain particular views of 
the developed software, such us graphical representations, program documentation or 
program activity traces to be used as input for trace-based simulators. The main 
features of MetaPL are the following: 

— flexibility, since it may be used in conjunction with any type of programming 
language or parallel programming paradigm; 

— completeness, as every single line of code contained in the source code program 
can be represented, if necessary. Conversely, the source code can be easily 
recovered from the description; 

— simplicity, as it supports code examination and understanding though graphical 
views which define simple transformations that can be used as input for graphical 
tools, or to generate human-readable code documentation; 

— suitability for performance evaluation: the program description allow the insertion 
of information on response times of portions of code, thus promoting the 
integration with performance evaluation tools. 

The use of a single, flexible notation system may help the development of CASE 
tools and data interchanging between them. Furthermore, its suitability for simulation 
promotes the use of performance analysis techniques in the early stages of the 
software development cycle. 

This paper is structured as follows. In the next section the key concepts of the 
language are introduced, along with a description of its structure and of the main 
solutions adopted. The core of the notation is presented, introducing the extension 
system and a simple message-passing extension. Then the concept of views and two 
examples of their use are dealt with. Finally, the conclusions are drawn. 



2 Meta-language Overview 

Owing to the great research efforts of the last decades in the field of parallel and 
distributed programming, there is currently a very high number of imperative 
parallel/concurrent programming languages, based on alternative memory models 
(shared-memory, message-passing, hybrid solutions). They allow (with some 
restrictions, of course) different kinds of interaction models to be exploited (client- 
server, peer-to-peer, master-slave, processor farm, ...). All the above can be targeted 
to radically different target hardware/software systems, exploiting run-time libraries 
in stand-alone computers, run-time libraries or O.S. calls on the top of communication 
protocol software in networked systems, or even resorting to specialized 
communication hardware. 

In this paper we describe a single unifying notation whose objective is to (try to) 
tame this complexity, assisting the software developer in: 
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— the development of high-performance parallel software from scratch by intensive 
use of prototypes and by performance prediction techniques integrated with 
software development tools (direct parallel software engineering, DPSE); 

— the examination, comprehension, refinement and performance improvement of 
fully-developed programs by simplified program views such as diagrams, 
animations, simulations (reverse parallel software engineering, RPSE). 

In DPSE, our development procedure requires the description of the basic structure 
of the algorithm through prototypes. These are static flow graphs made up of nodes 
corresponding to blocks of sequential code (i.e., sequences of executable statements 
involving no interaction with other tasks), and nodes corresponding to a basic set of 
parallel programming primitives (parallel activation and termination, wait for child 
task termination, communication, synchronization, ...). Once the blocks of sequential 
code have been annotated with execution time estimates, found by direct 
measurement on available code or by speculative benchmarking, it is possible to 
evaluate the time required for task interaction by simulation tools [11] or analytic 
models. Predicting the final performance, even at the very early stages of software 
development, makes it possible to adopt a cyclic software evolution technique. The 
(predicted) performance is validated against problem specifications. If the results are 
not satisfactory, it is necessary to revise (some of) the choices made in the previous 
development steps. Otherwise, the prototypes are refined, replacing nodes with sub- 
graphs or even with real code. Performance is validated once again, the design is 
further detailed, and so on. When the process stops, the original prototypes have been 
replaced with a fully-developed code compatible with the initial performance 
objectives. The whole process, represented graphically for a simple two-task program 
in Pig. 1 (left to right), requires a language- and paradigm-independent notation, made 
up of a simple set of primitives for expressing explicit parallelism, plus the ability to 
encapsulate opaque code or blocks of sequential statements. 

In RPSE, instead, a fully-developed program is to be represented as a prototype. 
This requires the construction of a static flow graph made up of nodes corresponding 
to concurrent programming constructs and nodes corresponding to sections of 
sequential code involving no interaction with other tasks. It should be noted that this 
requires some knowledge about the sequential programming language adopted, as 
well as the full range concurrent constructs exploited. An issue of paramount 
importance for the developer is to have the possibility to recover actual code from the 
prototype-like notation and vice versa. The RPSE process is also shown in Eig. 1, 
following the arrows from right to left. 

The main issue dealt with in this paper is the design of a notation able to support 
both direct and reverse development cycles. In light of the above, these have different 
representation requirements. In DPSE, it is important to manage hierarchical 
structures of code blocks, which are to be progressively detailed. In RPSE, instead, 
complete program codes have to be suitably handled. It should be explicitly pointed 
out that our proposal, MetaPL, is not decidedly a parallel programming language (as 
mentioned earlier, there is plenty of parallel languages), but just a simple and concise 
notation to support forward and reverse development cycles. It should also be clear 
that it is not possible to devise a notation able to support the totality of parallel 
programming languages and notations. The basic assumption made here is that the 
program is written in a conventional imperative sequential language (maybe an 
object-oriented one), extended with commands/functions for performing the basic 
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tasks linked to parallel programming (e.g., activation and termination of tasks, 
message/passing and/or shared-memory synchronization, ...)■ Each task runs until 
completion, executing its statements in a sequential fashion, communicating and 
synchronizing with other tasks on a single processor, or on multiple processors 
communicating by a network, a shared memory or a combination of the two. This is 
not a particularly restrictive assumption, as the majority of existing parallel/ 
distributed programs satisfies these requirements. 



i=> i=> i=> DPSE 




<=i <=i RPSE 



Fig. 1. The direct and reverse parallel software development cycles 

As briefly mentioned in the introduction, the main characteristic of the notation 
system, namely its flexibility, has been obtained by defining not a single notation, but 
an extensible language. MetaPL is composed of a core, a minimal set of language- 
and paradigm-independent commands for expressing high level prototypes of 
concurrent computations, plus language-, model- and paradigm-based extensions that 
can be added to the core notation, thus making it possible to represent real code at any 
level of detail in prototype-like form. As a matter of fact, extendibility is one of the 
key characteristics of XML [12]. In XML, a Document Type Definition (DTD) 
defines the Tags of the language and their composition mles. Hence it has been a 
natural choice to develop MetaPL as a collection of DTDs for XML. A further 
advantage linked to the use of XML is the possibility to use existing tools (e.g., 
Xerces, libxml, IE, Quilt, Swig) for parsing, querying (extracting information) or 
editing the XML documents describing a parallel program. 

In the subsequent language description we will define XML elements,^ their 
attributes and composition rules, explaining how they can represent parallel programs 



^ An XML document is composed of elements, the boundaries of which are either delimited by 
start-tags and end-tags, or, for empty elements, by an empty-element tag. 
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concepts. It is also worth pointing out that, unless otherwise noted, we will consider 
for expositive convenience only the direct cycle, showing the use of MetaPL in the 
DPSE context. However, the proposed notation can support RPSE as well. 



3 The Language Core 

The MetaPL core has a very limited description capability. This choice was made 
purposely, in order to avoid a unifying programming approach, which would lead to 
the impossibility to describe alternative programming paradigms and models. The 
objective of the core notation is to describe in a simplified way only the high-level 
structure of the parallel code (task activation and termination, plus minimal 
synchronization statements) and generic sequential code. In fact, parallel programs 
are made up of sequential code augmented with concurrency, communications and 
synchronization constructs typical of the programming paradigm adopted. Hence, 
facilities for describing sequential code are at the base of every parallel program 
description. 

A MetaPL description is essentially a hierarchical structure whose building blocks 
are different types of blocks of code. All blocks encapsulate sections of code, and 
may have attributes associates with them, such as the actual code contained, the 
expected execution time or the name of a cost function, which gives the (expected) 
execution time as a function of program inputs. It should be noted that the description 
by blocks encapsulating hierarchically the code at different levels of detail is indeed 
useful for prototype code in the direct cycle. It is less advantageous in RPSE, when 
complete programs have to be modeled. In the latter case, a complete description (in 
that every line of code contained in the source code program can be represented) can 
be obtained by exploiting the concept of “vagueness”: all the source code statements 
that are not directly supported by the notation system can be marked as 
GenericCommand, keeping the original code into the notation system. Whenever a 
new extension is applied to the description, the unknown commands are analyzed and, 
if recognized, replaced with a more detailed description. 

The basic type of block is the CodeBlock. By definition a CodeBlock is an 
opaque object, made up internally of a sequence of executable statements written in a 
conventional (sequential) programming language. A parallel program can be 
described as a set of CodeBlocks, along with commands that describe the high- 
level structure of the code. These commands^ can either be concurrent or sequential, 
depending on whether they involve some form of interaction among concurrent tasks, 
or not. Sequences of CodeBlocks and sequential commands can be combined into 
SequentialBlocks. In their turn, SequentialBlocks and concurrent 
commands compose the Block, which is the basic program unit assigned to a 
processor, and is executed as a separate task. 



^ As the word “commands” may be misleading, it is worth pointing out that in this context a 
command is not an executable statement, as MetaPL is just a notation for describing a 
computation, and not a programming language. However, a command corresponds to 
executable statements (or sequences of executable statements) in the described code. 
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3.1 Variables 

Even if MetaPL is not a programming language, and so there is no need for data 
management, the introduction of variables is of great help to describe an algorithm. 
Furthermore, it is useful to evaluate its performance in the presence of variable input 
data (e.g., the problem dimension). For this reasons, a MetaPL description may 
include Variable elements. They are identified by the name attribute, may contain 
an initial value attribute, and a further attribute describing the type. Instead of 
a constant initial value, it is possible to use the ASK predefined value. In this case, it is 
up to the user to supply interactively the value, whenever the value is actually 
required (typically, when a view is built from the program description). It is worth 
pointing out that MetaPL variables are meta-variables, i.e., they have only descriptive 
validity, and should never be confused with any variables possibly present in the 
encapsulated code. 



3.2 Sequential Commands: Loop and Switch 

The possibility to describe programs that have alternative paths (even at high level), 
or that perform an activity a number of times that depends on user input, is the 
primary reason for the inclusion in the MetaPL core of conventional sequential 
commands such as loop and switch. 

The Loop available in MetaPL is a for cycle executed a known number of times. 
The attributes of a Loop include the name of the loop control variable and the 
number of iterations; optional attributes (start, end, step) can be used to vary in 
a more complex way the values of the control variable. It should be noted that the 
coherence of the attribute values is not verified in the description system. For 
example, a loop to be repeated 8 times, controlled by a variable with start value 1 , end 
value 5, step 1, is well-formed, even not semantically correct. 

The following is a simple example that shows the use of Loop: 

<CodexLoop variable = "i" iterations = " 1" > 

<CodeBlock type=" opaque" > <description> Get the i-th element 
of the first vector, get the i-th element of the second 
vector and sum them </description></CodeBlock> 

</Loop></ Code> 

The next example shows the use of cost functions and how the loop can be hidden 
in an opaque CodeBlock. 

<Code> 

<CodeBlock type=" opaque" costfunction="VectorMul" > 
<description> Multiply two vectors and store the result in a 
single variable. The dimension is n </description> 

< / CodeBlock> 

</Code> 

<CostFunction name=" VectorMul" > The time spent is 0.15ms 
</CostFunction> 
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The Switch command enables the description of statements such as if-lhen-else. 
It contains a sequence of Case elements, and each of them contains the action to be 
performed. 

The Case element has two attributes: 

- prob, the probability that the option of the switch is selected; 

- condition, that describes the condition leading to the selection of a switch 
option. 

The two attributes may be used together, as shown in the following example. 
<Switch> 

<Case prob="5%" condition=" condition description" > 

<CodeBlock> Do something </CodeBlock> </Case> 

<Case prob="95%" condition="condition description" > 

<CodeBlock> Do something else </CodeBlock> </Case> 

< / Switch> 



3.3 Concurrent Commands: Spawn, Exit and Wait 

Concurrent commands are required to introduce into MetaPL the concept of a 
program composed of many concurrent tasks.** A parallel program and its task 
structure is described as in the following example: 

<ParallelProgram> 

<TaskDecl name=" father" id="l"> This is the father 

task</TaskDecl> 

<TaskDecl name="son" id="2"> This is the son task</TaskDecl> 

< / ParallelProgram> 

The task code description is given in the Task element, characterized by the name 
attribute; the code is composed of the MetaPL core statements introduced above, plus 
the concurrent commands shown below. 

The Spawn command is characterized by two attributes, the name of the spawned 
task, defined elsewhere along with the enclosed code, and its identifier. 

The Wait statement is used for the description of synchronization on task exit: the 
task that contains this element waits for the termination of the task whose id is 
reported in the waitforid attribute, or for the termination of the number of 
spawned tasks reported in the attribute number. The two attributes can never be 
present at the same time. 

The Exit statement indicates the end of a task and is used to terminate the 
corresponding Wait. 

The following example shows two tasks, synchronized by means of a Wait 
command. 



■* A “concurrent task” is the basic schedulable unit that can be assigned to a processor. In 
practice, it can be implemented by a process or by a thread, depending on the execution 
environment. 
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<Task name="master" ><Code> 

<Spawn spawnedname=" son" spawnedid="2" ></Spawn> 

<CodeBlock time="0.1" > Do something and close </CodeBlock> 
<Wait waitforid="2" /> 

<Exit /> 

</Codex/Task> 

<Task name=" son" ><Code> 

<CodeBlock time="0.3" > Do something and close </CodeBlock> 
<Exit /> 

</Codex/Task> 



4 The Language Extensions 

The notation system introduced until now is equivalent to other existing program 
description languages. However, MetaPL can gain major flexibility and expressive 
power by the use of extensions. Extensions are used to enable the notation system to 
describe programs that contain notations that cannot be converted into the ones 
defined in the MetaPL language core. 

The extensions to the language can be divided into two major classes: 

— language extensions, which expand the description capability of the core language, 
adding new commands typical of new programming paradigms or specific to a 
library or a programming language; 

— filters, which define the rules that can be used to obtain less complete and 
structured, though simpler, representations of the software {views). Typical 
program views are diagrams, traces or prototypes for simulation. 

There is a wide range of language extensions (some already developed, other only 
planned) that allow the description using MetaPL of the most widely used parallel and 
concurrent languages and notations. In this Section, we will propose as an example 
the Message Passing Extension (MPE). 

Eigure 2 shows the relationship between a document describing a parallel program, 
the language extension used, which define further programming concepts typical of 
the adopted programming paradigm, and the filters used to obtain traces for 
simulation and an HTML program view. 



4.1 The Message-Passing Extension (MPE) 

The MPE extends the core set of MetaPL commands with the introduction of non- 
blocking send (Send) and blocking receive (Receive). These basic commands are 
sufficient to describe the majority of message-passing task interactions. It should be 
explicitly noted that, as far as program description is concerned, information on the 
contents of exchanged message is useless, unless it influences the number of times a 
loop is performed, or the choice of one switch branch or another. However, this is not 
possible in MetaPL, owing to the (rather restrictive) way in which loops and switches 
are defined. 
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Language extensions 




Fig. 2. A parallel program description and MetaPL extensions 

The Send element is characterized by three attributes: 

- receiver, the id of the task that has to receive the message. This attribute is 
mandatory; 

- dim, the dimension in bytes of the message. This attribute is not mandatory, but it 
is useful for performance evaluation purposes; 

- msgtag, which can be used to identify a specific message. 

An example of Send element is the following: 

<Send receiver="2" msgtag="l" dim=" 10 ></Send> 

The Receive element, instead, has only the attributes sender and msgtag, 
both of which are not mandatory. If a sender is not specified, the receiver can get the 
message from any sender, as in the following example: 

<Receivex/Receive> 

Similarly, the absence of a msgtag attribute is representative of a situation in which 
the receiver is willing to accept any type of message. 



5 The Views 

By exploiting suitably-defined extensions, MetaPL can describe most parallel and 
distributed programs. However, even if the number of defined elements is not high, 
the resulting description can be neither concise, nor particularly easy to manage, 
because of XML language redundancy. As mentioned in the language overview, the 
views are descriptions less complete and structured than the MetaPL one, but able to 
highlight specific aspects of the program. For example, they could be UML sequence 
diagrams, portions of code, automatically-generated documentation of the program, 
prototypes or program traces useful for simulation purposes [11]. The derivation of 
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views from the original MetaPL description is performed by filters, which formally 
are extensions to the notation system.^ We will present here the filters that can be used 
to derive two views, useful as HTML documentation and as input for HeSSE 
simulation, respectively. 

The filters are made up of a description of the target document format (this 
description typically is a DTD), and of a non-empty set of additional documents 
(translators). The target format description may be absent if a standard output format 
(e.g., HTML) is used. The translators basically contain conversion rules from MetaPL 
to a new format and (possibly) vice versa, but are also used to produce additional 
support documents, such as translations logs. Formally, the translators are XSLT 
(XSL Transformations) documents [13]. XSLT is an XML-based language; XSLT 
well-formed documents define a transformation that, given a well-formed XML 
document, generates a new text document. The target could be a “traditional” format, 
such as HTML or XMI, or a brand new one. 

In some cases, in order to perform the defined transformations it may be necessary 
to supply additional information. This problem can be dealt with in two different 
ways. A first possibility is simply to report the absence of the data needed for 
translation in the output document. Later on, the user can supply them, and repeat the 
transformation process. Alternatively, the additional information can be asked to the 
user at translation time. 

The same technique used to extend the core language of MetaPL is also used for 
the filters. The core filter is able to handle only the notation defined in the MetaPL 
core. It may be suitably extended by filter extensions, which allow the conversion of 
elements not belonging to the core set. 



5.1 The MetaPL-HTML Filter 

The MetaPL-HTML filter produces a simple hypertextual view of the program, which 
highlights the computational steps carried out, and enables the developer to navigate 
through his/her code. Since the filter output format is HTML (a well-know format), 
the filter is made up only of an XSLT document (the translator) that defines the 
transformation from the MetaPL description to the HTML document. 

The generated HTML page is very simple. It starts with the declaration contained 
in the MetaPL element ParallelProgram, that is, with a marked list of the 
declared task names, each of which is a link to the task code description. In particular, 
the first XSLT definition in the translator associates the MetaPL XML document 
heading with the heading of the HTML document. The ParallelProgram 
MetaPL tag is associated with a first level heading (HI) whose content is the element 
attribute name. The Task element is substituted with an HTML second level 
heading, whose content is “Task name id”. The id attributes of the Task and 
ProcessDecl elements are used as names for an HTML bookmark to the heading. 
Each element defined in the language core is reported with the name in bold; its 
contents are provided in the form of an HTML paragraph. 



^ They are called here simply “filters” and not “filter extensions” because they are in their turn 
extensible. Hence by “filter extension” we will mean an extension to a filter. 
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Message-passing extensions to the filter make it also possible to handle Send and 
Receive commands, reporting them in bold font with hyperlinks to the bookmarks 
whose names are given by the sender and receiver attributes. 



5.2 The Simulation Filter 

As mentioned in the introduction, one of the primary design objectives of the MetaPL 
notation system is to promote the use of performance prediction tools at the early 
stages of software development. The simulation filter briefly described here can 
automatically perform the generation of traces that can be used to drive the HeSSE 
simulator [11] in order to obtain predictions of program performance on a given (real 
or fictitious) distributed heterogeneous computing platform. 

The output format of the Simulation filter is the trace format accepted as input by 
the HeSSE Simulator. Obviously, this format is textual, not an XML document. 
HeSSE traces are sequences of events corresponding to the basic actions that the 
system can simulate, such as CPU bursts {Workload event) or message-passing calls 
(Send or Receive). Formally, each event is a HeSSE System Call, and may have 
parameters (e.g., the receiver for a Send event). Each trace starts with a control 
symbol sequence (“PST”). 

The filter contains two translators. The first XSET, the MetaPL-HeSSE translator, 
is used to generate the trace input for simulation. The second one. Simulation-checker, 
is instead used to check if all information needed to simulate a program description is 
available. In fact, sometimes some values required for program simulation cannot be 
obtained by the code description and have to be supplied by the user. The translator 
generates an output log file wherein all the encountered problems are reported. The 
suggested filtering procedure is to test the description through the Simulation-checker 
first, to analyze the output log and, finally, to generate the output traces. 

5.2.1 The MetaPL-HeSSE Translator 

The MetaPL-HeSSE translator produces HeSSE trace files. It assumes that the 
MetaPL description can be translated; otherwise, the result may be inconsistent. The 
goal is to produce usable HeSSE trace files; the translator does not handle their 
subdivision according to the system configuration (HeSSE needs a separate file for 
each simulated task). Hence the XSLT simply generates a single text document, 
which can be successively subdivided since each task trace is marked by the control 
symbol sequence PST. 

Just to give the reader the flavor of the translation carried out, we will informally 
mention the transformations made under the assumption that Variable, Switch 
and CostFunction elements are not used, and that loops are not nested: 

• each CodeBlock is replaced with a Workload event and the content of its 
time attribute; 

• each Loop is unrolled: its content is analyzed a number of times equal to the value 
of the iteration attribute; 

• each Spawn and Wait is replaced with the corresponding HeSSE System Call, 
giving as parameter their attributes; 

• the Exit is replaced with the equivalent HeSSE System Call. 
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In addition, if the MetaPL-HeSSE filter extension handling message-passing is used: 

• for each Send (Receive) in the MetaPL deseription, a HeSSE Send (Receive) 
event is written to the output file. The sender, dim, msgtag attributes are 
reported as parameters of the HeSSE System Call. 

5.2.2 The Simulation- Checker Translator 

The output of this XSLT is a text file showing all the additional information that has 
to be supplied in order to be able to simulate the deseription. Even if simulation is 
possible and no further input is necessary, the output file may be non-empty, and 
contain additional information useful for simulation purposes. 

In practice, the Simulation-Checker translator checks that all the attribute values 
needed in the translation to HeSSE simulation traces can be obtained from values of 
MetaPL variables in scope. A typical example of missing value is the number of 
iterations of a loop where the final value of the control variable is supplied only at 
run-time. In this case, the MetaPL description is (necessarily) incomplete as far as 
simulation is concerned, and the value has to be supplied during the trace generation 
process. 

6 Conclusions 

This paper has introduced MetaPL, a notation system designed to describe parallel 
programs both in the direct and in the reverse software engineering cycle. The main 
features of MetaPL are: 

- flexibility, since it may be used in conjunction with any type of programming 
language or parallel programming paradigm; 

- completeness, as every single line of code contained in the source code program 
can be represented, if necessary. Conversely, the source code can be easily 
recovered from the description; 

- simplicity, as it supports code examination and understanding though views, which 
can be used as input for graphical tools, to generate human-readable code 
documentation, or as input for trace-driven simulations; 

- suitability for performance evaluation: the program description allows the insertion 
of information on the response times of portions of code, thus promoting the 
integration with performance evaluation tools. 

These characteristics have been obtained by exploiting heavily XML extension 
capabilities. In fact, MetaPL can describe programs code based on different memory 
models, interaction structures and programming languages. The possibility to include 
timing information in the program description makes it possible to analyze 
performance during software development. 

After a description of the main features of the notation system, we have shown its 
use to obtain two completely different kinds of views as an example. The first one is 
an automatically generated documentation on the Parallel Program in HTML, which 
enables the navigation of the code description by a simple web browser. The second 
view enables the performance evaluation of a parallel program from the MetaPL 
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description, producing a set of traces that can be used as input for the HeSSE 
simulator environment. 
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Abstract. In this paper first order 2D cellular neural networks (CNN’s) 
with homogeneous weight structure are investigated. It is proved that all 
CNN’s are divided into equivalence classes by respect to formed patterns 
properties. The method of learning hrst order CNN is proposed, which 
allows to hnd the parameters of CNN weight template if an example of 
stable state is given. 



1 Introduction 

There are a lot of natural phenomena including some reaction-diffusion pro- 
cesses of dissipative structures formation which have no adequate mathematical 
models up to day. For example this can be said about some chemical reactions, 
crystallization processes, as well as natural area of biological species, ecological 
phenomena and so on. So, the new models of spatial dynamics are currently un- 
der active investigation. They should not only explain qualitatively the existing 
phenomena but also are to be applicable for technological implementation on 
the modern parallel computing systems. In other words, they should have the 
properties of fine-grained parallelism with local interconnections. So the investi- 
gations of cellular automata (e.g., [1]) and CNN [2,3] simulative properties are 
of great interest. 

CNN consists of local interconnected elements (named cells). The connections 
are weighted, and each cell computes its output(s) as a nonlinear function of it 
internal state(s). In a discrete-time CNN all cells calculate their next states 
in parallel, i.e. iteratively and synchronously. The computation starts when all 
cells are set in an initial state, and stops at a stable state, when no cell changes 
its output state any more. The order of CNN is determined by the amount of 
variables representing the internal (or output) cell states. First order CNN stable 
state considered as a pattern in the form of the set of output cell states. In this 
work the simulative properties of first order homogeneous CNN’s for pattern 
formation are investigated. A method of their learning if the example of stable 
state is given is suggested. 

* Supported by RFBR, grants 00-01-00026, 01-01-06261 
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2 Formal Model Presentation 

Notions in this paper are based on those used in [4], We suppose that a 2D 
CNN consists of N cells which are enumerated in some way, i.e. each cell has an 
unique name. Connection structure in a CNN is characterized by a connection 
template T, which for each cell i consists of a set of its neighbor names, i.e. 
T[i) = {jo , . . . ,jq}, {jo = i), where q is the cardinality of the cell neighborhood. 
A real number a,k denotes the weight value of the connection between a cell 
named i and its neighbor jj. € T{i), the set of all neighbors weights is referred to 
a weight template A = {p, ai, . . . , a^}, where p = ao is a self- connection weight. 

For each cell the set of its neighbors states forms a neighborhood state Xi = 
{xo, . . . ,a^g}, where xj. is the state of the neighbor jk of the cell i. The output 
state Pi of the cell i is a non-linear function of Xi, i.e. y-i = f{xi). Here we use 
the following piece-wise function 

f{x) = ^{\x+l\-\x-l\). (1) 

A cell i with — 1 < Xj < 1 is called a linear cell, otherwise it is called a saturated 
cell. Further D = f{Xi) denotes the neighborhood output state of the cell i. With 
the above notations a weighted sum of neighbors output states can be written as 
follows: Ax Yi = 'YljeT Depending of this sum the cell changes its state Xj 
in time in a discrete time CNN according to the following synchronous updating 
rule 

Xi{t + 1) = Xi{t) + r{-Xi{t) + Ax Yi{t)) (2) 

where r is a time discretization parameter. Computation either lasts endlessly, 
or it stops at a stable state (in this case we say that CNN forms a pattern which 
is the set of output state of all cells {yi,i = 1, . . . , N}) when no cell output state 
is changed in time any more. 

In this paper we consider only space invariant templates, i.e. symmetric ones, 
because CNN’s with such templates are known to be stable [5]. A given state 
is stable if and only if for all cells the following set of linear equalities and 
inequalities holds [3]: 



Vi Yy (^jVj >0 if |a:i| > 1 

\X-T J (3) 

ajPj = Xi if |xi| < 1 

XT 

3 Properties of Patterns Formed by 2D CNN 

The main goal of CNN investigation in this paper is to describe the possible 
stable states in relation to connection template properties. Some results for 2D 
CNN were obtained in [3] but they mostly concern mosaic patterns with the 
output state of each cell is from the set {—1,0,+!}. 
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Here we are not restricted by mosaic states allowing cell outputs to be in 
[—1,1]. Before new results presentation let’s tell some words about the formed 
patterns. Here we have in mind that a pattern is characterized by the areas 
made up of pictures repeated more or less systematically called in [3] motives 
(dense black, chessboards, stripes etc.). There exists a boundary (probably with 
zero width) between neighbour areas with different motives. So, each pattern 
is characterized by a set of inherent properties such as a set of motives and 
boundary parameters: width and maximal curvature. Consequently, each CNN 
with the weight template A can be characterized by the properties of all possible 
patterns formed. 

At first let’s look at the following problem: does it exists the one-to-one 
correspondence between the weight template and properties of possible stable 
states? Let a CNN have the weight template A, and C is its stable state. From 
(3) it follows that there is an infinite set of weight templates which provide the 
stability of a given pattern C. It can be shown in the following way. Let A be 
the template which satisfies the conditions (3) for the pattern C. If we multiply 
all elements of A by a constant 6 > 0, then these conditions may not hold for 
all cells. Particularly, for a linear cell i with the state ci and the neighbourhood 
state C?: 

b[A X Ci) = bci ^ Ci, when b^ 1. (4) 

In order to correct (4) it is enough to add the value (1 — b) to the selfconnection 
weight in bA (the obtained template is further denoted as A(fe)): 

(A(fe) X Ci) = bci + {I - b)ci = Ci. (5) 

It is easy to show that the template A)b) , b > 0 satisfies (3) for all saturated cells. 
Consequently we have an infinite set of weight templates ( equivalence class), each 
of them can form the given pattern C. Moreover, the equivalence class A)b) of 
weight templates is described by the following system of parametric equations: 

s{b) = sb, p{b) = l + {p- 1)6, , . 

6 > 0 , 

where p and s - the parameters of weight template A. From (6) it follows that 
the selfconnection weight value in A{b) is calculated by the following formula: 
p(6) = 1 + (p — 1)6. Since 6 > 0, then if the value p in A is greater than 1 then 
p(6) can have any value above the 1 and vice versa, if p < 1 then p(6) < 1. 
Consequently, the set of all equivalence classes of weight templates consists of 
three disjoint subsets: 1) with p < 1; 2) with p > 1; and 3) with p = 1. 

This result is useful for investigation of stable states in homogeneous CNN 
because it reduces the amount of independent weight parameters. Moreover we 
can fix the selfconnection weight during the learning process in order to obtain 
the concrete weight template from the equivalence class. So it is possible to use 
for CNN learning the methods based on the Perceptron Learning Rule [7]. 

For weight template with not more than three independed weight parameters 
we should investigate the equivalence classes described by not more than two 
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independed parameters. This can be done by building the 2D diagrams [6], any 
restricted area of them is visually presents the properties of patterns which might 
be formed by CNN with the weight template parameters equal to the coordinates 
of the diagram. 

4 Method of CNN Learning 

Let the pattern C be given, the size and weight parameters of template A should 
be found such that the stability conditions (3) are satisfied. Based on this prob- 
lem statement a learning method is elaborated [6] meeting the following con- 
ditions: 1) it should be local; 2) it should guarantee the individual stability of 
prototypes (patterns which are to be stored); 3) the number of prototypes should 
be as large as possible. This method based on Perceptron Learning Rule [7], and 
extensive simulation showd that the proposed method allows to find the param- 
eters of weight template (one of the equivalence class) if the example of stable 
state is given. 

5 Conclusion 

In this paper the pattern formation properties of homogeneous 2D CNN are 
investigated. It is shown that all weight templates are devided into equivalence 
classess by respect to properties of possible stable states. Method of CNN learn- 
ing is suggested, which allows to find the parameters of weight template if ex- 
ample of stable state is given. 
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Abstract. This paper is a short and informal introduction to failure 
detector oracles for asynchronous distributed systems prone to process 
crashes and fair lossy channels. A distributed coordination problem (the 
implementation of Uniform Reliable Broadcast with a quiescent proto- 
col) is used as a paradigm to visit two types of such oracles. One of 
them is a “guessing” oracle in the sense that it provides a process with 
information that the processes could only approximate if they had to 
compute it. The other is a “hiding” oracle in the sense that it allows to 
isolate and encapsulate the part of a protocol that has not the required 
behavioral properties. A quiescent uniform reliable broadcast protocol is 
described. The guessing oracle is used to ensure the “uniformity” require- 
ment stated in the problem specification. The hiding oracle is used to 
ensure the additional “quiescence” property that the protocol behavior 
has to satisfy. 

Keywords: Asynchronous Distributed Systems, Failure Detectors, Fair 
Lossy Channels, Fault-Tolerance, Oracles, Process Crashes, Quiescent 
Protocol, Uniform Reliable Broadcast. 



1 Introduction 

One of the most striking and disturbing fact of the fault-tolerant asynchronous 
distributed computing field is the number of impossibility results that have been 
stated and proved in the past years [20, 2f]. One of the most outstanding of those 
results is related to the Consensus problem. This problem is defined as follows: 
each process proposes a value and the processes that do not crash have to agree 
(termination) on the same value which has to be one of the proposed values 
(safety). It has been shown by Fischer, Lynch and Paterson that this apparently 
simple problem actually has no deterministic solution as soon as even only one 
process can crash [14]. This is the famous FTP impossibility result. On the other 
side, it is also important to note that a characterization of the problems that 
can be solved in presence of at most one process crash has also been proposed 
[ 8 ]. 

When a problem cannot be solved in a given model (representing a particular 
context) several attitudes are possible. One consists in modifying the problem 
statement in order to get solutions to a close (but “modified”) problem. For the 
consensus problem, this has consisted in weakening some of its properties. For 
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example, the weakening the termination property has given rise to probabilis- 
tic protocols [7], Other studies have considered the weakening of the agreement 
property (e.g., e-agreement [12], and fc-set agreement [11]). Another attitude 
consists in enriching the underlying fault-prone asynchronous distributed sys- 
tem with appropriate oracles in order that the problem becomes solvable in the 
augmented system. 

The oracle notion has first been introduced as a language whose words can be 
recognized in one step from a particular state of a Turing machine [15,19]. The 
main characteristic of such oracles is to hide a sequence of computation steps in a 
single step, or to guess the result of a non-computable function. They have been 
used to define equivalence classes of problems and hierarchies of problems when 
they are considered with respect to the assumptions they require to be solved. 
In our case, the oracle notion is related to the detection of failures. These oracles 
do not change the pattern of failures that affect the execution in which they are 
used. Their main characteristic is not related to the number of computation steps 
they hide, but to the guess they provide about failures. Such oracles have been 
proposed and investigated in the past years. Following their designers (mainly S. 
Toueg) they are usually called failure detectors [4,2,9]. A given failure detector 
oracle is related to a problem (or a class of related problems). Of course, it has 
to be strong enough to allow to solve the concerned problem, but, maybe more 
important, it has to be as weak as possible in order to fix the “failure detector” 
borderline beyond which the problem cannot be solved. 

When we consider the consensus problem, several failure detector classes have 
been defined to solve it [9]. It has also been shown that one of these classes is 
the weakest that can be used to solve consensus [10]. A failure detector belongs 
to this class if it satisfies the following two properties. Completeness: Eventu- 
ally, every process that crashes is suspected by every correct process. Eventual 
Weak Accuracy: Eventually, there is a correct process that is not suspected by 
the correct processes. As we can see, the completeness is on the actual detection 
of crashes, while the accuracy limits the mistakes a failure detector can make. 
Several consensus protocols based on this weakest failure detector oracle have 
been designed [9,25]. It is important to note that a failure detector satisfying 
the previous properties cannot be implemented in an asynchronous distributed 
system prone to process crashes (if it was, it would contradict the FTP impossi- 
bility result!). However, a failure detector that does its best to approximate these 
properties can be built. When the behavior of the underlying system allows it 
to satisfy the completeness and the eventual accuracy properties during long 
enough time, the current execution of the consensus protocol can terminate, and 
consequently the current instance of the consensus problem can be solved. 

This paper is an introductory visit to failure detector oracles for asynchronous 
distributed systems where processes can fail by crashing and links can fail by 
dropping messages. To do this visit, we consider a distributed computing prob- 
lem related to distributed coordination, namely the Uniform ReliaMe Broadcast 
(URB) problem [16]. This is an important problem as it constitutes a basic dis- 
tributed computing building block. Informally, URB is defined by two primitives 
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(Broadcast and Deliver), such that (1) if a process delivers a message m then all 
processes that do not crash eventually deliver m, and (2) each process that does 
not crash eventually delivers at least the messages it broadcasts. By interpreting 
the pair Broadcast/Deliver as This_is_an_order/Execute_it, it is easy to see that 
URB abstracts a family of distributed coordination problems [4,17]. 

Furthermore, in order to fully benefit from the visit, we are interested in 
solving the URB problem with a quiescent protocol. This means that, for each 
application message m that is broadcast by a process, the protocol eventually 
stops sending protocol messages. This is a very important property: it guarantees 
that the network load generated by the calls to the Broadcast primitive remains 
finite despite process and links failures. 

The paper is made up of seven sections. Section 2 introduces the underly- 
ing system layer and Section 3 reminds a few results related to the net effect 
of process and links failures. Then, Section 4 defines the URB problem. Sec- 
tion 5 presents a “guessing” and a “hiding” failure detector oracles (that have 
been introduced for the first time in [4] and [2], respectively). These oracles are 
then used in Section 6 as underlying building blocks to define a quiescent URB 
protocol. Section 7 concludes the paper. 

2 Asynchronous Distributed System Model 

The system model consists of a finite set of processes, namely, U = {pi, . . . ,p„}. 
They communicate and synchronize by sending and receiving messages through 
channels. Every pair of processes pi and pj is connected by a channel which is 
denoted [pippj). 

2.1 Processes with Crash Failures 

A process can fail by crashing, i.e., by prematurely halting. A crashed process 
does not recover. A process behaves correctly (i.e., according to its specification) 
until it (possibly) crashes. By definition, a correct process is a process that never 
crash. A faulty process is a process that is not correct. In the following, / 
denotes the maximum number of processes that may be faulty (/ < n — 1). 
There is no assumption on the relative speed of processes. 

2.2 Fair Lossy Channels 

In addition to process crashes, we consider that channels can fail by dropping 
messages. Nevertheless, they are assumed to be fair lossy. This means that for 
each channel [pi,pj) we have the following properties: 

— FLC-Fairness (Termination): If pi sends a message m to pj an infinite number 
of times and pj is correct, then eventually pj receives m. 

- FLC-Validity: If Pj receives a message m from pi, then pi previously sent m 
to Pj. 
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— FLC-Integrity: If pj receives a message m infinitely often from pi, then pi 
sends m infinitely often to pj . 

It is important to note that (1) there is no a priori assumption on the message 
transfer delays, and (2) a message can be duplicated a finite number of times. 
The basic communication primitives used by a process pi are: send () to pj, and 
receive () from pj. 

3 A Few Results 

3.1 The Case of a Single Channel 

When we consider a system as simple as one made up of two processes connected 
by a channel, there are some impossibility results related to the effect of process 
crashes, channel unreliability, or the constraint to use only bounded sequence 
numbers (see [21] -chapter 22- for an in-depth presentation of these results). Let 
a reliable channel c^e( be a channel such that there is no loss, no duplication, 
no creation, and no reordering. Let us consider two processors connected by a 
channel c. The aim is to design on top of c a protocol offering a reliable channel 

^rel ■ 



— Let US assume that c is reliable, each processor can crash and recover but 
has not access to a non-volatile memory. There is no protocol that builds a 
reliable channel c^e( and that tolerates the crash/recovery of the processors 
[13]. To tolerate it, a non-volatile memory is necessary in order that the 
processor state can survive crashes. 

— Let us assume that the processors cannot crash, and the underlying channel 
c can duplicate or reorder messages (but it does not create or lose messages). 
Moreover, only bounded sequence numbers are allowed. It is impossible to 
design a protocol that implements a reliable channel Crei on top of c [30]. 

— Let us assume that the underlying channel c can lose and reorder messages 
but cannot duplicate them. Moreover, the processors do not crash, and only 
bounded sequence numbers are allowed. There is a protocol that builds Crei 
on top of c, but this protocol is highly inefficient [1]. 



3.2 Simulation of Reliable Channels in Presence of Process Crashes 

The effect of lossy channels on the solvability of problems in general is discussed 
in [6]. Two main results are stated. 

— The first concerns a specific class of problems, namely those whose specifica- 
tion does not refer to faulty processes. This is the class of correct- restricted 
problems. An algorithm is provided that transforms any protocol solving 
a correct-restricted problem and working with process crashes and reliable 
channels into a protocol working with process crashes and fair lossy links. 
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— The second result is more general in the sense that it does not consider a 
particular class of problems. It presents a protocol that, given a system with 
fair lossy channels and a majority of correct processes, simulates a system 
with reliable channels. Informally, this shows that a majority of correct pro- 
cesses is powerful enough to cope with message losses when channels are 
fair. 

The two proposed transformations do not provide quiescent protocols. 



4 Uniform Reliable Broadcast 

4.1 Definition 

The Uniform Reliable Broadcast problem (URB) is defined in terms of two 
communication primitives: Broadcast() and Deliver(). When a process issues 
Broadcast(m), we say that it “broadcasts” m. Similarly, when a process issues 
Deliver(m), we say that it “delivers” m. Every broadcast message is unique^. This 
means that if an application process invokes Bioadcast[mi) and Broadcast[m2) 
with nil and m2 having the same content, m\ and m2 are considered as two 
different messages by the underlying layer. 

Uniform Reliable Broadcast is formally defined by the following set of prop- 
erties [16]: 

— URB-Termi nation: If a correct process broadcasts m, then any correct process 
delivers m (no messages from correct processes are lost). 

— URB-Validity: If a process delivers m, then m has been broadcast by some 
process (no spurious message). 

— URB-Integrity: A process delivers a message m at most once (no duplication). 

— URB-Agreement: If a (correct or not) process delivers m, then any correct 
process delivers m (no message UR_delivered by a process is missed by a 
correct process). 

The last property is sometimes called “Uniform Agreement”. Its non-uniform 
counterpart would be: “If a correct process delivers m, then any correct process 
delivers m” . The Uniformity requirement obliges to also consider the messages 
delivered by faulty processes. The Reliable Broadcast problem is similar to URB 
except for the Agreement property that is non-uniform. 

Let us remark that, differently from the other properties, the URB- 
Termination property does not apply to faulty processes. This means that the 
correct processes deliver the same set of messages S, and that the set of messages 
delivered by a faulty process is always a subset of S. 



^ This can easily be realized, at the underlying level, by associating with each appli- 
cation message m a pair made up of its sender identity, plus a sequence number. 
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4.2 URB with Reliable Channels 

Figure 1 describes a simple quiescent protocol (defined in [16]) that solves the 
URB problem in asynchronous distributed systems made up of processes that 
(1) can crash, and (2) are fully connected by reliable channels (no loss, no du- 
plication, and no creation of messages). To broadcast a message m, a process 
Pi sends it to itself. Then, when a process receives a message m for the first 
time, it forwards it before delivering it. Consequently, due to channel reliability, 
it follows that the four URB properties are satisfied. 



(1) Procedure Broadcast)™): 

(2) send msg)™) to pi 

(3) when msg(m) received from pk'- 

(4) if (first reception of to) then 

(5) Vj ^ i,k do send msg)™) to pj enddo; 

(6) Deliver)™) 

(7) endif 



Fig. 1. A quiescent URB protocol for reliable channels 



5 Enriching the System with Appropriate Oracles 

A main difficulty in solving the URB problem in presence of fair lossy links lies 
in ensuring the URB-Agreement property which states: “If a process delivers 
a message to, then any correct process delivers to”. This means that a process 
can deliver a message only when it is sure that this message will eventually be 
received by each correct process. It has been shown that failure detector oracles 
are required to overcome this problem [4,17]. The failure detector (called 0) 
described below is an answer to this problem. It has been introduced in [4]. 

Although 0 is the weakest failure detector that can be used to ensure the 
URB-Agreement property [4], its only use is not sufficient to get a quiescent 
protocol: the broadcast of an application message can still generate an infinite 
number of network messages. Actually, ensuring the quiescence property requires 
that a process 'Pi be able to know if another process pj is still alive: if pj is not. 
Pi can stop sending messages to pj even if the last message it sent to it has 
not yet been acknowledged. Several failure detectors can be designed to allow a 
process 'Pi to get this information. Some (as 0) provide outputs with bounded 
size. Others provide outputs whose size is not bounded. It has been shown that 
the failure detector oracles of the first category cannot be implemented [9], while 
some of the second category can be. Hence, in the following we present an oracle 
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of the second category called Heartbeat that can be implemented. This oracle 
has been introduced in [2], 

5.1 A Guessing Failure Detector Oracle: 0 

This failure detector [4] is defined by the following properties. Each process pi 
is endowed with a local variable TRUSTEDi whose aim is to contain identities 
of processes that are currently perceived as non-crashed by Pi (this variable is 
updated by 0 and read hy pi). The failure detector 0 ensures that these variables 
satisfy the following properties: 

— ©-Completeness: There is a time after which, for any process pi, TRUSTED^ 
does not include faulty processes. 

— ©-Accuracy: At every time, for any process pi, TRUSTED^ includes at least 
one correct process. (Note that the correct process trusted by pi is allowed 
to change over time.) 

In the general case {f < n), the © oracle cannot be implemented in an asyn- 
chronous distributed system. That is why we place it in the family of “guessing” 
failure detector oracles. Differently, when the system satisfies the additional as- 
sumption / < n/2, it can be implemented (such an implementation is described 
in [4]). 

5.2 A Hiding Failure Detector Oracle: Heartbeat 

The Heartbeat failure detector oracle [2] provides each process pi with an array 
of counters HBi[l..R] (initialized to [0, . . . ,0]) such that: 

— HB-Completeness: For each process pi, stops increasing if pj is faulty. 

— HB-Accuracy: HBi[f] never decreases, and HBi[j] never stops increasing if pi 
and Pj are correct . 

A Hearbeat failure detector can be easily implemented, e.g., by requiring each 
process to periodically send “I am alive” messages. This implementation entails 
the sending of an infinite number of messages by each correct process: it is not 
quiescent. That is the reason why we place it in the family of “hiding” failure 
detector oracles. A set of modules (one per process) realizing a Hearbeat oracle 
can be used to encapsulate and isolate the non-quiescent part of a protocol and 
thereby hides its undesirable behaviors. 

6 A Protocol 

6.1 Description of the Protocol 

A quiescent URB protocol is described in Figure 2 for a process pi. It is based 
on the previously described failure detectors and the classical acknowledgement 
mechanism. An important local data managed by a process Pi is recJnji[m] 
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which records the processes that, to pi’s knowledge, have received a copy of the 
application message m. The protocol uses two types of messages, tagged “msg” 
and “ack” , respectively. They are called protocol messages, to distinguish them 
from the messages broadcast by the application. Each protocol message tagged 
“msg” carries an application message, while one tagged “ack” carries only the 
identity of an application message. 

To broadcast an application message m, a process pi sends a protocol message 
tagged “msg” and including m to itself (line 2). When, it receives a protocol 
message carrying an application message m for the first time (line 12), a process 
Pi activates the task DiShse[m) which repeatedly (lines 5-10) sends m to the 
processes that, from pj’s point of view, have no copy of m and are alive. The 
Heartbeat failure detector is used by pp to know which processes are locally 
perceived as being alive. It is important to note that, as soon as the test at line 
7 remains permanently false for all j , then pi stops sending messages (but does 
not necessarily terminate as it can keep on executing the loop if the condition of 
line 10 remains false^). Each time pi receives a protocol message tagged “msg”, 
it sends back an “ack” message to inform the sender that it has got a copy of m 
(line 16). When a process receives an “ack” message, it updates accordingly the 
local data recJ]yi\m\ (line 18). 

Einally, if pi has not yet delivered an application message ra, it does it as 
soon as it knows that at least one correct process got it (condition TRUSTED^ C 
recJnji[m] at line 19). 

6.2 Proof 

The proof that the protocol described in Eigure 2 satisfies URB-Integrity (no 
duplication of an application message) and URB-Validity (no creation of appli- 
cation messages) are left to the reader. The proof has the same structure as the 
proof given in [4]. 

Lemma 1. If a correct process starts Diffuse[m) , eventually all correct processes 
start Diffusefrn) . 

Proof Let us first observe that if the identity k belongs to recJyyi\m], this 
is because pi received msg(m) or ack(m) from pk and updated consequently 
rec-byi[m] at line 13, 15 or 18, from which we conclude that pj. has a copy of m. 

Let us consider a correct process pi that starts Diffuse(m). It launches this 
task at line 14 when it receives m for the first time. Let pj be a correct process. As 
Pj is correct, keeps on increasing and the subcondition [prevMbi\rn\[j] < 

curJibi[j\) is infinitely often true. We consider two cases: 

— Case (j € recJyyi[m]) . In that case, due to the previous observation, pj has 
a copy of m. We conclude from the protocol text, that pj started DiShse[m) 
when it received m for the first time. 

^ ft is important not to confuse a quiescent protocof and a terminating protocof. En- 
suring termination requires stronger faifure detector oracfes, namefy, oracfes that 
affow to know exactfy which processes have crashed and which have not [18]. 
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(1) Procedure Broadcast(TO): 

(2) send msg(TO) to pi 

(3) Task DiSuseirn): 

(4) prevMhi[m] [—1, . . . 

(5) repeat periodically 

(6) cur_hbi ^ HBi; 

(7) Vj / i: if [ij)rev-hbi[m,][j] < cur-hbi[j]) A (j ^ rec-byi[m]')') 

(8) then send msg(TO) to pj endif; 

(9) prev-hbi[m] -h- curJibi] 

(10) until (Vj 6 [l..w] : (j € rec-byilm]) ) endrepeat 

(11) when msg(m) is received from p^: 

(12) if (first reception of to) 

(13) then rec-byi[m] ^ 

(14) activate task Diffuse{rn) 

(15) else recJ}yi[m\ v- recJ}yi[m\ U {fc} endif; 

(16) send ack(TO) to pk 

(17) when ack(TO) is received from pk'- 

(18) rec-byi[m] -h- rec-byi[m] U {fc} 

(19) when [[pi has not yet delivered to) A (TRUSTEDi C rec-byi[m]')') 

(20) do Deliver(TO) enddo 



Fig. 2. A quiescent uniform reliable broadcast protocol 



— Case (y ^ recJnji[m]) . In that case pi keeps on sending copies of m to pj 
(at line 8). Due to the FLC-Fairness property of the channel {pi,Pj), Pj 
eventually receives m from pi and, if not yet done, starts Diffuse[m) . 



Lemma 2. If all correct processes start Diffuse[m), they eventually execute 

Deliver(m). 

Proof Let us assume that all the correct processes execute Diffusefrn) and let 
Pi and Pj be two correct processes. So, pi sends m to pj until it knows that m 
has been received by pj (i.e., until j € recJyyi\m\). Due to the acknowledgment 
mechanism and the FLC-Fairness property of the underlying channels, this even- 
tually occurs. It follows that, for each correct process pi, recJjyi[m] eventually 
includes all correct processes. 

Let us now consider the set TRUSTED^. Due to the ©-Completeness property 
of the 0 failure detector, TRUSTED^ eventually does not include faulty processes. 
It follows that the condition (TRUSTEDi C recJ)yi[m]) eventually becomes true, 
and then Pi executes Deliver(m). o 

\ / Lemma L 
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Theorem 1. URB-Termination. If a correct process executes Broadcast(m), 
then all correct processes execute Deliver(m). 

Proof If a correct process pi executes Broadcast(m), it sends msg(m) to itself 
(line 2) and consequently starts the task DiSuse[m) (lines 12-14). Then, due to 
Lemma 1, all correct processes start Diffuse[m) , and due to Lemma 2, they all 
execute Deliver(m). ^Theorem I 

Theorem 2. URB- Agreement. If a process executes Deliver(m), then all correct 
processes execute Deliver(m). 

Proof If a (correct or not) process pi executes Deliver) m), then the condition 
(trusted^ C rec_byi[m]) was satisfied just before it executes it. Due to the 
©-Accuracy property of the 0 failure detector, TRUSTEDi includes at least one 
correct process pj. Hence, pj € recJyyi[m], from which we conclude that there 
is at least one correct process that received m (at line 11). As pj is correct, it 
started the task Diffuse[m) when it received m for the first time. It then follows 
from Lemmas 1 and 2 that each correct process executes Deliver(m). ^Theorem 2 

Theorem 3. Quiescence. Each invocation o/ Broadcast(m) gives rise to a finite 
number of protocol messages. 

Proof Let us observe that the reception of an “ack” protocol message never en- 
tails the sending of a protocol message. It follows that we only have to show that, 
for any application message m, eventually no process sends protocol messages 
of the form msg(m). 

A msg(m) protocol message is sent at line 8 by the task DiShse[m) . So, we 
have to show that any process pi eventually stops executing line 8. This is trivial 
if 'Pi crashes. So, let us consider that 'pi is correct. There are two cases according 
to the destination process pj\ 

- Case 1: pj is faulty. Then due to the HB-Completeness, HBi[j] eventually stops 
increasing, and from then on prevJibi{m\\j] = curJibi\j] =HBi[f] is permanently 
true, from which we conclude that pj stops sending messages to pj . 

- Case 2: pj is correct. In that case the subcondition [prevJibi [m] [j] < curJibi [j ] ) 
is infinitely often true. So, let us consider the second subcondition, namely, 
(j ^ recJ)yi[m]). Let us assume that the subcondition (j € recJ)yi[m]) is never 
satisfied. We show a contradiction. 

If {j € recJ)yi[m]) is never satisfied, it follows that pj sends an infinite number 
of protocol messages msg(m) to 'Pj. Due to the FLC-Fairness property of the 
channel {'Pi,'Pj), Pj eventually receives an infinite number of copies of m. Each 
time it receives msg(m), pj sent back ack(m) to pj (line 16). R then follows 
that, due to FLC-Fairness property of the channel fpjppi), 'Pi receives an ack(m) 
protocol message from pj. At the first reception of such a protocol message, pi 
includes j in rec_6pj[m] (line 18). Finally, let us note that a process identity is 
never removed from rec-b-yilm]. So from now on, the condition (j € rec_6pi[m]) 
remains permanently true. A contradiction. ^Theorem 3 
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6.3 Favoring Early Quiescence 

This guideline in the design of the protocol described in Figure 2 was simplicity. 
It is possible to improve the protocol by allowing early quiescence (in some cases, 
this can also reduce the number of protocol messages that are exchanged). To 
favor early quiescence, the variable recJnji [m] of each process has to be updated 
as soon as possible. This can be done in the following way: 

-(f) add the current content of recJyyi[m] to each protocol message sent by a 
process pi; 

- (2) add to each “ack” message the corresponding application message (instead 
of only its identity); 

- (3) send “ack” messages to all the processes (instead of only the sender of the 
corresponding “msg” message). 

The resulting protocol is described in Figure 3. Its improved behavior is obtained 
at the price of bigger protocol messages. Its proof is similar to the proof of Section 
6.2. The main difference lies in the way it is proved that j € recJnji[m] means 
that pj has got a copy of m. 



(1) Procedure Broadcast)™): 

(2) send msg(m, 0) to pi 

(3) Task DiSuseirn): 

(4) prevJihi[m\^[—l,... ,— 1]; 

(5) repeat periodically 

(6) cur-hbi ^ HBi; 

(7) Vf / i: if [ij)rev-hbi[m,][j] < cur-hbi[j]) A (j ^ rec-byi[m]')') 

(8) then send msg(m, rec_6j/i[m]) to pj endif; 

(9) prev-hbi[m] -h- curJibi] 

(10) until (Vj 6 [l..w] : {j € rec_byi[m\) ) endrepeat 

(11) when type(m,recJ)y) is received frompk'- 

(12) if (first reception of to) 

(13) then rec-byi[m] ^ {i} U rec-by; 

(14) activate task DiSuseim) 

(15) else rec_6j/i [to] ^ rec_6j/i [to] U rec_6j/ endif; 

(16) if {{type ^ ack) V (first reception of to)) A{k ^ i)) 

(17) then \/j / i do send ack{m,rec-byi[m]) topj enddo endif 

(18) when ((pi has not yet delivered to) A (TRUSTEDi C recJ}yi[m\)) 

(19) do Deliver)™) enddo 



Fig. 3. An improved quiescent URB protocol 
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6.4 Strong Uniform Reliable Broadcast 

The URB-Termination property of the URB problem is only on the correct 
processes. Said another way, a message broadcast by a process that crashes 
(either during the broadcast or even later) is not required to be delivered. Its 
actual delivery depends on the system behavior. This can be a drawback for 
some applications. So, let us define Strong Uniform Reliable Broadcast (S_URB). 
We define this communication service as being similar to URB except for the 
termination property which is: 

— S_URB-Termination: If a process completes the execution of Broadcast(m), 
then a correct process delivers m. (This means that, whatever the future 
behavior of its sender, no message that has been broadcast is lost). 

We conclude from the combination of URB-Agreement and S_URB-Termination 
that each correct process delivers all the messages whose broadcasts have been 
completed. 

The S_URB-Termination property can easily be implemented. When we con- 
sider Figure 2, only a very simple modification of the procedure Broadcast(m) is 
required. Namely, the only statement wait (TRUSTEDi C recJnji[m\) has to be 
added at the end of this procedure. It ensures that when a broadcast completes, 
the corresponding application message is known by at least one correct process 
(that will disseminate it in its Diffuse task) . 



7 Conclusion 

Failure detector oracles are becoming a fundamental issue in the design of fault- 
tolerant distributed applications designed to run on fault-prone distributed sys- 
tems. The aim of this paper was to provide a simple introduction to their phi- 
losophy and to illustrate it with some of them, namely, a “guessing” oracle and 
a “hiding” oracle. 

The design of a quiescent protocol solving the Uniform Reliable Broadcast 
problem has been used as a paradigm to show why failure detector oracles are 
required and how they can be used. The guideline for the design of this protocol 
was simplicity (as we have seen, more efficient protocols can be designed). 

The reader interested in more details on the concept of failure detector ora- 
cles, the problems they can help to solve, and their uses, can consult 
[2,4,9,10,17,18,25,29]. 

As far as the consensus problem is concerned, the randomization approach 
has been investigated in [7]. The combined use of random oracles and failure 
detectors oracles for consensus has ben investigated in [3,26]. Failure detec- 
tors appropriate to solve the k-set agreement problem have been investigated 
in [26,31]. Recently, a randomzation approach to solve this problem has been 
proposed in [27]. 

Very recently, a new Condition-based, approach has been proposed to solve 
the consensus problem [22]. It consists in identifying sets of input vectors for 
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which it is possible to design a consensus protocol that works despite up to 
/ faults. Such conditions actually define a strict hierarchy [23], Moreover, this 
approach can be extended to solve more gernal agreement problems [24] . 
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Abstract. Advances in wireless networking technology and portable computing 
devices have led to the emergence of a new computing paradigm known as 
mobile computing and a number of applications. As a result, software 
applications have to be redesigned to take advantage of this environment while 
accommodating the new challenges posed by mobility. 

As mobile users wander about, they are bound to encounter a variety of 
different information sources (databases) that are often autonomous and 
heterogeneous in nature. Such a collection of autonomous and heterogeneous 
database is often known as a multidatabase. The existing multidatabase systems 
do not readily support mobile computing. A new class of multidatabase that 
provides access to a large collection of data via a wireless networking 
connection is proposed — a Mobile Data Access System (MDAS). Within the 
scope of MDAS, a new transaction-processing model is proposed that allows 
timely and reliable access to heterogeneous and autonomous data sources while 
coping with the mobility issue. The proposed model extends the existing 
multidatabase system without any adverse effect to the preexisting local and 
global users. This is accomplished through the implementation of multi tiered 
mobile transaction proxies that manage the execution of mobile transactions on 
behalf of the mobile user. The proposed transaction-processing model is 
simulated and the results are analyzed. 



1 Introduction 

The mobile computing paradigm has emerged due to advances in wireless networking 
technology and portable computing devices. Mobile computing enables users 
equipped with portable computing devices to access information services through a 
shared infrastructure, regardless of physical location or movement. The mobile 
computing environment is a distributed computing platform with the following 
differences: the mobility of users and their access devices, frequent disconnection, 
limited bandwidth and the mobile resource constrains U limited computational and 
power sources. 

Mobile users now have the ability to send and retrieve emails, receive updates on 
stock prices and weather, and obtain driving directions while in motion using cellular 
phones, pagers, and PDAs. Wireless transmission media across wide-area tele- 
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communication networks are also an important element in the technological 
infrastructure of E-eommerce [21]. The effective development of guided and wireless- 
media networks will support the delivery of World Wide Web functionality over the 
Internet. Using mobile technologies will enable users to purehase E-commeree goods 
and services anywhere and anytime. Naturally, mobile users also desire the same 
funetionality available to them at a stationary computer on a wired network U edit 
and save changes to documents stored on a file server or to query and update shared 
data in private or eorporate databases. The focus of this paper is on the latter. 

As mobile users wander about, they are bound to encounter a variety of different 
information sources (databases) that are often autonomous and heterogeneous in 
nature. It would be advantageous if a uniform interface ean be presented to the mobile 
users freeing them from the need to have knowledge of the data representation or data 
aeeess method employed at different data sources. Organizing a colleetion of 
autonomous databases into a multidatabase is therefore desirable. A multidatabase 
integrates pre-existing autonomous and heterogeneous databases to form a global 
distributed information- sharing paradigm. To support mobile users, it is necessary to 
augment the existing multidatabases with wireless networking capabilities. This 
augmented multidatabase is known as a Mobile Data Aeeess System (MDAS) [13]. 

The MDAS must have the capability of supporting a large number of mobile users. 
It is necessary that the MDAS provide timely and reliable access to shared data. 
Multidatabases have been designed to meet these requirements, albeit within the scope 
of the fixed networking environment. However, these systems have not been designed 
to eope with the effeets of mobility. 

Transactions are the means of access to shared data in databases; this is also the 
case in a multidatabase and a MDAS. Transaction management in an MDAS 
environment has some inherent problems due to the full autonomy of local nodes over 
the execution of transactions and the limitations imposed by the mobile computing 
environment. In this environment, the global transaction manager (GTM) must be able 
to deal with: i) different local transaction management systems; ii) different local 
concurrency control mechanisms; iii) lack of communication with local nodes, and iv) 
limitations of the mobile computing environment. 

Concurrency control is needed in order to increase throughput and to allow timely 
and reliable access to shared data and must therefore support simultaneous execution 
and interleaving of multiple transactions. In an MDAS environment, the coneurrency 
control algorithm has to overcome the effects of the local autonomy, in addition to 
constraints imposed by the mobile units. 

As an example, consider a transaction in execution on a stationary computer on a 
wired network. The oceurrence of a diseonnection is often treated as a failure in the 
network thus, when this occurs the exeeuting transaction is aborted. In a mobile 
computing environment, which is characterized by frequent disconnection (users may 
choose to disconneet voluntarily, for instance to conserve battery life), diseonnection 
cannot be treated as a failure in the network. 

Transactions issued from mobile clients tend to be long-lived. Thus, transaetions 
issued by mobile users are exposed to a larger number of disconnections. Another 
effect of long-lived transactions is that it could result in low system throughput. Long- 
lived transactions are more likely to result in conflicts. Pessimistic locking schemes in 
the implementation of concurrency control could result in blocking of concurrently 
executing transactions, resulting in deadloeks and aborted transactions. On the other 
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hand, employment of optimistic concurrency control could result in a high rate of 
transaction restarts. Thus, a new transaction model is needed for the MDAS 
environment that manages concurrency control and recovery, handles frequent 
disconnection, and address the issue of long-lived transactions while at the same time 
does not violate the autonomy of the local data sources. 

The goal of this paper is to present such a transaction processing model. The model 
is built on the concept of global transactions in multidatabase based on the Summary 
Schemas Model [6]. This work expands our effort reported in [13] by implementing 
an additional layer on top of the MDBS that handles mobile transactions, 
disconnection, and long-lived transaction. 

Section 2 addresses the background material on multidatabase systems and mobile 
computing and the issues that affect the MDAS. Section 3 is a description of the 
MDAS transaction processing model and the necessary protocols. Section 4 presents 
the results and analysis of the simulation model of the proposed model. Finally, 
Section 5 concludes the paper and addresses several future research issues. 



2 Background 

The basis of the MDAS is the mobile computing environment and the multidatabase. 
Thus, this section gives a brief overview of the mobile computing environment, 
multidatabase systems, and the concepts and issues that characterize these 
environments. 



2.1 Mobile Computing Environment 

The mobile computing environment is a collection of mobile hosts (MH) and a fixed 
networking system [8], [10], [13]. The fixed networking system consists of a collection 
of fixed hosts connected through a wired network. Certain fixed hosts, called base 
stations or Mobile Support Stations (MSS) are equipped with wireless communication 
capability. Each MSS can communicate with MHs that are within its coverage area (a 
cell). MHs can move within a cell or between cells, effectively disconnection from 
one MSS and connecting to another. At any point in time, a MH can be connected to 
only one MSS. MHs are portable computers that vary in size, processing power, and 
memory. Wireless Communication, mobility, and portability are three essential 
properties of mobile computing that pose difficulties in the design of applieations 
[10]. 



2.2 Multidatabase Systems 

A multidatabase system integrates pre-existing local databases to form a single 
integrated global distributed database system. It is a colleetion of autonomous local 
database systems (LDBS), possibly of different types. The integration of the DBMSs 
is performed by multiple software sub-systems at the local databases [3], [19]. The 
local databases are unaware of the existence of the global database [20]. Loeal 
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autonomy is the key requirement in the design of a multidatabase. In a multidatabase 
there are two types of users: local users and global users. Local autonomy guarantees 
that the local users aecess their own local database independent of, and unaffected by, 
the existence of the multidatabase and its global users. Autonomy in multidatabases 
comes in the form of; design autonomy, participation autonomy, communication 
autonomy, and execution autonomy [3]. 



2.3 MDAS Issues 

The MDAS is a multidatabase system that has been augmented to provide support for 
wireless access to shared data. Issues that affect multidatabases are therefore 
applicable to the MDAS. Mobile eomputing raises additional issues over and above 
those outlined in the design of a multidatabase. In the following we examine the 
effects of mobility on query processing and optimization, and transaction processing. 

• Query Processing and Optimization: The higher communieation cost of wireless 
medium and limited power of a mobile unit may lead to the design of query 
proeessing and optimization algorithms that focus on reducing the financial cost of 
transactions and consideration for query processing strategies for long-lived 
transactions that do not rely on frequent short communieations but longer 
communications. Query optimization algorithms may also be designed to select 
plans based on their energy consumption. Approximate answers will be more 
aeeeptable in mobile databases than in traditional databases due to the frequent 
disconnection and the long latency time of transaction execution [1]. 

• Transaction Processing: Since disconneetion is a common mode of operation, 
transaction processing must provide support for disconneeted operation. 
Temporary disconneetion should be tolerated with a minimum disruption of 
transaction processing, and suspension of transactions on either stationary or 
mobile hosts. In order for users to work effectively during periods of 
disconnection, mobile computers will require a substantial degree of autonomy 
[1],[13],[18]. Effects of mobile transaetions committed during disconnection 
should be incorporated into the database while guaranteeing data and transaetion 
correctness upon reconneetion [18]. Atomie transactions are the normal mode of 
aeeess to shared data in traditional databases. Mobile transactions that access 
shared data cannot be structured using atomic transactions. However, mobile 
computations need to be organized as a set of transactions some of which execute 
on mobile hosts and others that execute on the mobile support hosts. The 
transaction model will need to include aspects of long transaction models and 
Sagas. Mobile transactions are expected to be lengthy due to the mobility of the 
data consumers and/or data producers and their interactive nature. Atomic 
transactions cannot satisfy the ability to handle partial failures and provide 
different recovery strategies, minimizing the effects of failure [1],[7],[20]. 

• Transaction Failure and Recovery: Disconnection, bandwidth limitations, and 
higher probability of damage to the mobile devices are some of the possible 
sources of failure in mobile environments. Special action can be taken on behalf of 
active transactions at the time a disconnection is predicted — a transaction 
processes may be migrated to a stationary computer particularly if no further user 
interaction is required. Remote data may be downloaded in advance of the 
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predicted disconnection in support of interactive transactions that should continue 
to execute locally on the mobile machine after disconnection. Log records needed 
for recovery may be transferred from the mobile host to a stationary host [1]. 



2.4 Summary Schemas Model 

The Summary Schemas Model (SSM) has been proposed in [6] as an efficient means 
to access data in a heterogeneous multidatabase environment. The SSM uses a 
hierarchical meta structure that provides an incrementally concise view of the data in 
the form of summary schemas. The hierarchical data structure of the SSM consists of 
leaf nodes and summary schema nodes. The leaf nodes represent the portion of local 
databases that are globally shared. Each higher-level node (summary schema nodes) 
provides a more concise view of the data by summarizing the semantic contents of its 
children. The terms in the schema are related through synonym, hypernym and 
hyponym links. The SSM allows a user to submit a request in his/her own terms. It 
intelligently resolves a query into a set of subqueries using the semantic contents of 
the SSM meta data. The overall memory requirements for the SSM, compared to the 
requirements of a global schema, are drastically reduced by up to 94%. Subsequently, 
the SSM meta data could be kept in main memory, thus reducing the access time and 
query processing time. Furthermore, for resource scares MDAS access devices, 
caching the upper levels of the SSM meta data structure allow a great amount of 
autonomy to each mobile unit. Finally, the SSM could be used to browse data by 
“stepping” through the hierarchy, or view semantically similar data through queries. 



3 Proposed Transaction Processing Model 

The proposed MDAS transaction model is based on a multi tiered approach capable of 
supporting pre-existing global users on the wired network in addition to mobile users. 
The proposed transaction model is implemented as a software module on top of the 
pre-existing multidatabase management system. Integration of the mobile computing 
with the pre-existing multidatabase system in then the key challenge in MDAS. 



3.1 MDAS Transactions 

We may distinguish three types of transactions: 

• Local transactions that access only local data at each LDBS, 

• Global Transactions that access data at more than one LDBS, and 

• Mobile transactions that could access data from more than one LDBS. 

In reality, a mobile transaction is no different from a global transaction as far as 
the MDBS layer is concerned. However, a number of factors make it sufficiently 
different enough to consider it as a separate transaction type in the MDAS: 

• Mobile transactions require the support of stationary hosts for their 
computations and communications. 
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• Mobile transactions might have to split their computations, with one part 
executing on a mobile client and the other part executing on a stationary host. 

• Mobile transactions might have to share state and data. This is a violation of 
the revered ACID transactions processing assumptions. 

• Mobile transactions might have to be migrated to stationary hosts in order to 
accommodate the disconnection of the mobile client. 

• Mobile transactions tend to be long lived. This is a consequence of the 
frequent disconnection experienced by the mobile client and the mobility of 
the mobile client. 



3.2 MDAS Transaction Model 

The MDAS as we envision it, consists of a software module, called a Mobile 
Transaction Manager (MTM), implemented above the MDBS layer. The two layers 
combined form the MDAS. The MTM is responsible for managing the submission of 
mobile transactions to the MDBS layer and their execution. Thus, the MTM acts as a 
proxy for the mobile unit, thereby establishing a static presence for the mobile unit on 
the fixed network. The other half, the GTM is responsible for managing the execution 
of global transactions submitted by non-mobile users and mobile transactions 
submitted on behalf of the mobile unit by the MTM. 

Our approach is based on the principle that the computation and communication 
demands of an algorithm should be satisfied within the static segment of the system to 
the extent possible [2]. In another words, we attempt: i) to localize communication 
between a fixed host and a mobile host within the same cell, ii) to reduce the number 
of wireless messages by downloading most of the communication and computation 
requirements to the fixed segment of the network, and iii) to develop distributed 
algorithm based on the maintained logical structure among the fixed network. 

Mobile transactions are submitted to the MDBS layer in a FIFO order by the 
MTM. There are two operating modes that reflect the level of delegation of authority 
to the proxy by the mobile client. 

• Full Delegation Mode: In this mode the mobile client delegates complete 
authority of the mobile transaction to the MTM. The MTM has the authority to 
commit the transaction upon completion. If there is a conflict the MTM may 
decide to abort the transaction and resubmit it, later on. In any case, the mobile 
client is notified of the status of the transaction and will receive the results (if any). 

• Partial Delegation Mode: In this mode more participation is required of the 
mobile client. The mobile client has the final say on whether or not the transaction 
should be committed. The MTM submits the transaction to the MDBS and 
manages its execution on behalf of the mobile client. Upon completion of the 
operations of the transaction, the mobile client is notified and the MTM waits for 
the commit or abort message from the mobile client. 

In applying the proposed transaction-processing model to the MDAS we may 
derive the following benefits: 

• Our protocol decouples the effects of mobility from the MDBS. Hence, any 
developed concurrency control and recovery mechanism can be readily adopted 
into our protocol. 
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• The MDBS layer does not need to be aware of the mobile nature of some nodes. 
The mobile transactions are submitted to the MDBS interface by the transaction 
proxies. The MDBS interacts with the transaction proxy as though it were the 
mobile unit. In the case of a mobile transaction, most of the communication is 
within the fixed network and as far as the MDBS is concerned, a static host has 
initiated the transaction. 

• The operations of non-mobile users are unaffected by the transactions of mobile 
users. The effects of long-lived transactions can be effectively and efficiently 
handled. Delegating the authority to commit and/or abort a transaction on behalf of 
the mobile host to the transaction proxy can minimize the effects of long-lived 
transactions. Thus, transactions initiated by non-mobile users will experience less 
conflict and as a consequence system throughput and response times are not 
severely affected. 

• The mobile host may disconnect and freely change location since the transaction 
proxy acts on its behalf without requiring any participation from the mobile host 
unless it is interested in the outcome. 



3.3 Operating Modes 

Mobile Host - MSS Relationship. In the proposed MDAS transaction-processing 
model, communication occurs through the exchange of messages between static 
and/or mobile hosts. In order to send a message from a mobile host to another host, 
either fixed or mobile, the message is first sent to the local MSS over the wireless 
network. The MSS forwards the message to the local MSS of the other mobile host, 
which forwards it over the wireless network to the other mobile host if it is meant for 
a mobile host. Otherwise, the message is directly forwarded to the fixed host. The 
location of a mobile host within the network is neither fixed nor universally known in 
the network. Thus, when sending a message to a mobile host the MSS that serves the 
mobile host must first be determined. This is a problem that has been addressed 
through a variety of routing protocols (e.g. Mobile IP, CDPD) at the network layer 
[4,11]. We are not concerned with any particular routing protocol for message 
delivery but instead assume that the network layer addresses this issue. 

Each MSS maintains a list of ids of mobile hosts that are local to its cell. When a 
mobile host enters a new cell, it sends a join message to the new MSS. The join 
message includes the id (usually the IP address) of the mobile host. When the MSS 
receives the join message adds the mobile host to its list of local mobile hosts. To 
change location, the mobile host must also send a leave message to the local MSS. 
The mobile host neither sends nor receives any further messages within the present 
cell once the leave message has been sent. When the MSS receives the leave message 
from the mobile host, it removes the mobile host id from its list of local mobile hosts. 

Disconnection is often predictable by a mobile host before it occurs. Therefore, in 
order to disconnect, the mobile host sends a disconnect message to the local MSS. The 
disconnect message is similar to the leave message, the only difference being that 
when a mobile host issues a leave message it is bound to reconnect at some other MSS 
at a later time. A mobile host that has issued a disconnect message may or may not 
reconnect at any MSS later. When the MSS receives the disconnect message a 
disconnect flag is set for the particular mobile host id. If an attempt is made to locate a 
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disconnected mobile host the initiator of the search will be informed of the 
disconnected status of the mobile host. 

A mobile host that issues a leave message or a disconnect message must issue a 
reconnect message to reconnect to a MSS. The reconnect message must include the 
ids of the mobile host and the previous MSS at which it was last connected. The id of 
the previous MSS is necessary so that the new MSS and the previous MSS can 
execute any handoff procedures necessary, for instance, unsetting the disconnect flag. 
When the MSS receives the reconnect message it adds the mobile host to its list of 
local mobile hosts and executes any handoff procedures with the prior MSS. 

Mobile Host - MTM Relationship. To initiate a transaction, the mobile host sends a 
Begin-Transaction message to the MTM. The MTM acknowledges the request by 
returning a transaction sequence number. Each MSS has a MTM associated with it 
and transaction sequence numbers are assigned in a distributed manner among the 
MTMs in the system using any distributed ordering algorithm, for example, Lamport’s 
algorithm [12]. The mobile host tags each transaction request message with a 
transaction id, which is composed of the mobile host id, and the transaction sequence 
number. The transaction request message is composed of the mobile host id, the 
transaction sequence number, and the transaction operations. To signify the 
completion of a transaction request, an End-Transaction message is sent to the MTM. 
Transaction execution is delayed until the receipt of the End-Transaction message. 
This is in order to guarantee that the entire transaction as a whole is submitted to the 
MDBS. 



3.4 Transaction Processing Model Work Flow 

The transaction processing model workflow can be described as shown in Fig. 1 . 

• The mobile host initiates a transaction request message. The message is received 
by the MSS, and is forwarded to the associated MTM. 

• The MTM receives the transaction request from the MSS. The transaction request 
is logged and the transaction id (transaction sequence number + mobile host id) is 
placed in the ready list. A transaction proxy is created to execute the transaction. 

• The transaction proxy removes a transaction id from the ready list and inserts it 
into the active list. The transaction proxy translates the transaction request and then 
submits the transaction to the MDBS for execution. 

• The transaction request is executed at the MDBS layer and the results and/or data 
are returned to the transaction proxy. 

• The transaction proxy places the transaction id in the output list along with the 
results and data to be returned to the mobile host. 

• The MTM initiates a search for the location of the mobile host and the results are 
transferred to the mobile host if it is still connected and then the transaction id is 
removed from the ready list. 
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Mobile transaction Completed transaction 
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MTM: Mobile Transaction Manager TP: Transaction Proxy 

MDBS: Multidatabase System 

Fig. 1. Transaction processing workflow 



3.5 Disconnected Operation 

We turn our attention to the case where the mobile host is no longer connected to the 
local MSS while the transaction is still in execution. By handing over transaction 
execution to transaction proxies, the disconnection of a mobile host or its relocation 
does not affect the transaction execution. The key issue to be addressed is how to 
notify the mobile host of the results of the transaction execution. In this case the 
following actions are taken: 

On reconnection at the new MSS the mobile host should supply the id of the 
previous MSS to which it was connected. A handoff procedure is then initiated 
between the two MSSs. 

• As part of the handoff procedure, the MTM at the previous MSS searches its ready 
list, if the transaction request issued by tbe mobile host has not yet been processed 
it is forwarded to the MTM at the new MSS and inserted into its ready list. Thus, 
control of transaction execution is transferred to the new MSS. 

• If the transaction has completed its execution then the results are forwarded to the 
new MSS, which subsequently returns them to the mobile host. 

• If the transaction is still active then control is not transferred but the new MSS 
places the transaction request in its active list but marks it as being executed at 
another site. The previous MSS will initiate a search for the new MSS of the 
mobile host when the transaction is complete in order to transfer the results to it. 
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4 Simulation Result and Analysis 

4.1 Simulator 

A simulator was developed to measure the feasibility of the proposed protocol within 
an MDAS environment. The MDASsim simulator and some results from a 
comparison between the two modes of operation of the Mobile Transaction Manager 
(Full Delegation Mode and Partial Delegation Mode) are presented. The simulator is 
based on the DBsim simulator presented in [15]. However, DBsim has been extended 
to support the concepts of a multidatabase and the MDAS. 

The DBsim is an event driven simulator written in C++. The DBsim was designed 
as a framework to simulate different scheduling policies. Its architecture is based on 
an object-oriented paradigm, with all the major components implemented as classes in 
C++. The simulator is a collection of cooperating objects, comprising of: the event 
controller, transaction manager (TM), scheduler, data manager (DM), and the 
bookkeeper. A multidatabase is much more complex to model, compared to a 
distributed database mainly due to the local autonomy and heterogeneity issues. As a 
result, the DBsim was enhanced with additional flexibility to simulate the important 
aspects of the MDAS environment. In order to achieve this we introduced three new 
concepts to the simulation model: 

• The DBsim architecture implemented a single transaction manager. We have 
departed from the single transaction manager module implemented in the original 
DBSim simulator by allowing a transaction manager at each of the local nodes. 

• An additional layer above the local transaction managers was implemented to 
manage global and mobile transactions. This is the global transaction manager 
(GTM). For the purpose of our simulation, the GTM serves as the Mobile 
Transaction Manager (MTM) as well. 

• We have introduced the concepts of global and mobile transactions into the 
simulation model. 

For each simulated local node we have one data manager (DM) object, one 
scheduler object and one transaction manager (TM) object. The GTM object is 
responsible for creating mobile and global subtransactions that generate operations 
to the transaction managers at each local node. The architecture of our simulator is 
shown in Fig. 2. 



4.2 Global Transaction Manager 

A Global Transaction is resolved by the summary schema’s meta data. As a result, the 
global transaction is decomposed into several subtransactions, each resolved at a local 
node. This process also recognizes a global transaction manager for the global 
transaction U a global transaction manager is the lowest summary schema node that 
semantically contains the information space manipulated by the transaction. In our 
simulated environment, the number of local nodes at which the transaction is resolved 
is chosen randomly from the number of nodes in the system. The global or mobile 
transaction makes calls to the local transaction managers to begin execution of the 
subtransactions. 
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Fig. 2. MDASSim architecture 

To allow multiprogramming (MP) level at each local node, the simulator maintains 
a fixed number of local transactions executing simultaneously at each node U this 
number is varied for different simulation run. A fixed number of global/mobile 
transactions are also maintained in the system. The ratio of global to mobile 
transaction is varied for different simulation runs, as well. Together, the fixed number 
of simultaneous local transactions and the fixed number of global/mobile transactions 
serve as an approximation of a system with a constant load. Every time a transaction 
(local, global or mobile) is terminated, a new one is created after some random delay. 

Upon creation (submission) of a transaction (subtransaction) to a local node, its 
operations (read, write, commit or abort) are begun to schedule for execution. Every 
time an operation finishes successfully, the transaction, after a short delay, generates a 
new operation or it decides end the transaction by sending a commit or abort 
operation. 



4.3 Commit Protocol 

Each local node implements the two-phase commit (2PC) protocol. In case of a global 
or mobile subtransaction, the GTM coordinates the commit protocol so as to ensure 
that either all or none of the subtransactions succeed to preserve the atomicity of the 
global transaction. A timeout is used to simulate the obtaining of permission to 
commit the mobile transaction from the mobile unit when there is a need to do so. A 
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commit or abort is returned based on the probability of communication between the 
MTM and the mobile unit during the timeout period. 



4.4 Simulation Parameters 

The behavior of our multi-tiered control level protocol is determined based on several 
parameters. Some of these parameters are hardware, software, and administrative 
dependent, and others are application dependent. It is very important that reasonable 
values be selected for each parameter. The system parameters were derived from the 
underlying platform. Additional parameters for the mobile component of the system 
were obtained from the work reported in [13]. These parameters are presented in 
Tables 1-3. 



Table 1. Min and max values of interval parameters 



Parameter 


Min. 


Max. 


Number of operations in local transactions selected 


2 


8 


Number of operations generated in a burst 


3 


5 


Time between transactions 


10ms 


100ms 


Time between operation requests 


1ms 


10ms 


Time between operations in a burst 


1ms 


3ms 


Time to perform a disk operation 


Sms 


16ms 


Restart delay 


500ms 


1500ms 



Table 2. Application parameters 



Parameter 


Value 


Number of transactions 


20000 


Size of address space, # of resource units 


20000 


Hot spot size, # resource units 


2000 


Hot spot probability 


50% 


Abort probability 


0.1% 


Read probability 


80% 


Burst probability 


20% 


Block size 


4KB 



Table 3. Global and Mobile Unit Parameters 



Parameter 


Default Value 


Number of global/mobile transactions in the systems 


10 


Service time for each communicated message to the mobile 
unit selected randomly 


0.3 - 3 sec 


Probability of mobile unit not being found after the timeout 


0.20 
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4.5 Simulations and Results 

Our simulations were based on a constant load. The MPL (the number of simultaneous 
local transactions) at the local sites during each simulation run was constant (varied 
from 1 to 25) with a mix of global and mobile transactions. At all times, the total 
number of global transaetions (global/mobile) is also constant while the ratio of global 
to mobile transactions varies for each simulation run (chosen as 20%, 40%, 50%, 60% 
and 80%). The throughput was used as the performance measure and it was measured 
against parameters such as: number of simultaneous local transactions (MP-Level), 
the varying ratio of global to mobile transactions, and the two different operating 
modes of the MTM (full-delegation and partial delegation modes). In general, as one 
could expect, at a lower MP-level, the global/mobile throughput was higher due to the 
fewer local transactions in the system and less likelihood of conflicts among 
transactions. As the MP-level increased, the global/mobile throughput dropped as a 
result of more local transactions in the system with increased likelihood of indirect 
conflicts among global/mobile transactions. 

Figs 3-5 show the throughput of both global and mobile transactions as the 
number of simultaneous local transactions and the ratio of global to mobile 
transactions are varied. The charts compare the results under the Full Delegation mode 
of operation and the Partial Delegation mode of operation. As can be noted, the 
performance under the Full Delegation mode (FDM) surpasses that of the Partial 
Delegation mode (PDM) since the proxy needs to communicate with the mobile unit 
under the latter scheme. However, such performance degradation is quite tolerable 
specially, when one considers the flexibility and adaptability of our approach. 



5 Conclusion and Future Directions 

5.1 Conclusion 

This paper proposed a new transaction-processing model for the mobile data access 
system (MDAS). The proposed multi-tiered transaction model uses the concepts of 
transaction proxies to manage the execution of mobile transactions. To provide 
support for mobile transactions, a layer, the Mobile Transaction Manager (MTM), is 
implemented above the pre-existing multidatabase system. Using proxies the proposed 
model decouples the effects of mobility - frequent disconnection, limited bandwidth, 
limited computational resources, etc. - from the multidatabase systems. 

Two modes of operation, namely. Full delegation mode and partial delegation 
mode, were proposed to address the level of participation of a mobile unit in the 
completion of a mobile transaction. In the Full Delegation mode of operation, the 
mobile unit relinquishes control of the final commit/abort of a transaction to the 
MTM. In the Partial Delegation mode of operation, the mobile unit has the final say 
on whether to commit or abort the transaction. The MTM must communicate with the 
mobile unit when the transaction is ready to be committed. However, should the 
mobile unit be unavailable, the MTM is free to abort the transaction after a sufficient 
time out period. 

A simulator written in C-H- was developed to evaluate the feasibility and 
performance of the proposed transaction-processing model. The simulation results 
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showed that the performance of the Full Delegation mode of operation is better than 
the Partial Delegation mode. This comes about as a result of fewer communications 
between the mobile unit and the multidatabase system. The performance of the system 
was evaluated by varying the number of simultaneous local transactions executing at 
each node and by varying the ratio of global to mobile transactions present in the 
system. 





Global (FDM) 
Mobile (FDM) 
Global (PDM) 
Mobile (PDM) 



Fig. 3. Throughput with 20% Mobile Transactions 




Fig. 4. Throughput with 50% Mobile Transactions 



5.2 Future Directions 

The proposed transaction processing system can be extended in a number of ways: 

• Our simulation results showed the validity of the proposed transaction-processing 
model. However, it would be interesting to study the model in a real mobile 
computing environment A potential approach would be to implement the MDAS 
as part of the Mobile Computing Environment and simulation test bed (MCE) 
proposed in [16]. 
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Fig. 5. Throughput with 80% Mobile Transactions 

• The simulations were run with a fixed number of nodes in the system. The effect 
of varying the number of nodes on the system should be examined. 

• The test results showed the performance of the system when either of the two 
operation modes was employed. The effect of mixed operation modes should be 
examined. It would be interesting to study the effect of mobility at the data sources 
level, as well. 

As the final notes, the development of effective E-commerce technologies is in its 
formative stage. As E-commerce moves from a largely business-to-business model to 
include a proliferation of retail seeking channels, the demand for mobile data access 
will proliferate. The problems of effective mobile data access must be resolved to 
allow the effective development of electronic markets. 
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Abstract. The paper aims at extending the categorical approach to 
Petri net based models with time constraints. We dehne a category of 
net processes with dense time, and use the general framework of open 
maps to obtain a notion of bisimulation. We show this to be equivalent 
to the standard notion of timed bisimulation. Next, decidability of timed 
bisimulation is shown in the setting of hnite net processes. Further, the 
result on decidability is applied to time safe Petri nets, using a timed 
version of the McMillan-unfoIding. 



1 Introduction 

Category theory has been used to structure the seemingly confusing world of 
models for concurrency - see [29] for a survey. The general idea is to formalize 
that one model is more expressive than another in terms of an ‘embedding’, most 
often taking the form of a coreflection, i.e. an adjunction in which the unit is an 
isomorphism. The models are equipped with behaviour preserving morphisms, 
to be thought of as kinds of simulations. 

An important ingredient of every theory of concurrency is a notion of equiva- 
lence between processes. Bisimulation [10] is the best known behavioural equiva- 
lence. In an attempt to understand the relationships and differences between the 
extensive amount of research within the field of bisimulation equivalences, Joyal, 
Nielsen, and Winskel [12] proposed an abstract category-theoretic definition of 
bisimulation. They identify spans of morphisms satisfying certain ‘path lifting’ 
properties, so-called open maps, as an abstract definition of bisimilarity. Further, 
in [22] open maps have been used to define different notions of bisimulation for 
a range of models, but none of these have modelled real-time. 

Recently, the demand for correctness analysis of real time systems, i.e. sys- 
tems whose descriptions involve a quantitative notion of time, increases rapidly. 
Timed extensions of interleaving models have been investigated thoroughly in the 
last ten years. Various recipes on how to incorporate time in transition systems 
- the most prominent interleaving model - are, for example, described in [2,21]. 

* This work is partially supported by the Russian Fund of Basic Research (Grant N 
00-01-00898). 
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Timed bisimulation was shown decidable for finite timed transition systems by 
Cerans in [7], and since then more efficient algorithms have been discovered in 
[14,27]. 

On the other hand, the incorporation of quantitative information into non- 
interleaving models has received scant attention: a few extensions are known of 
pomsets [6], asynchronous transition systems [1], net processes [3,4,15,26], and 
event structures [13,20], In this respect, Petri net models are the only nice ex- 
ception: various timed generalizations of the models are known in the literature 
(see [16,24] among others). 

In this paper, we present a model of timed net processes which are a timed 
extension of occurrence nets (e.g., [8]) by associating their events with two timing 
constraints that indicate earliest and latest occurrence times both with regard to 
a global clock. Events once ready - i.e., all their causal predecessors have occurred 
and their timing constraints are respected - are forced to occur, provided they are 
not disabled by others events. A timed net process progresses through a sequence 
of states by occurring events at a certain time moment. An event occurrence takes 
no time. The model appeared to us as a most simple and natural approach to 
our purpose. 

The main contribution of the paper is to show the applicability of the general 
categorical framework of open maps to true concurrent models with dense time. 
We first define a category of timed net processes, where the morphisms are 
to be thought of as simulations, and an accompanying path (sub)category of 
timed words, which, following [12], provides us with notions of open maps and 
a bisimulation. Next, we show within the framework of open maps that timed 
bisimulation is decidable for finite timed net processes. Further, the result on 
decidability is applied to safe Petri nets with discrete time [16,4], using a timed 
version of the McMillan-unfolding [19]. 

There have been several motivations for this work. One has been given by 
the paper [8] where a theory of branching processes of Petri nets has been pro- 
posed. Further, the approach has been successfully extended to timed net models 
(see [3,15,26]). A next origin has been the papers [17,18,23,25,28], which have 
extensively studied categorical characterizations of Petri net based models. Fur- 
thermore, the paper [5] first establishes a precise connection between morphisms 
of Petri nets which consider only their static structures and morphisms on their 
dynamic behaviours, and then applies the results to a discrete timing of Petri 
nets. Finally, another motivation has been given by the paper [11], which pro- 
vides an alternative proof of decidability of bisimulation for an interleaving model 
with dense time (finite timed transition systems) in terms of open maps, and 
illustrates the use of open maps in presenting timed bisimilarity. 

The rest of the paper is organized as follows. The basic notions concerning 
timed net processes are introduced in the next section. A category of timed net 
processes and an accompanying path (sub)category of timed words, are defined 
in Sect. 3. Section 4 introduces the concept of open morphism and shows its 
decidability in the framework of finite timed net processes. In Sect. 5, basing 
on spans of open maps, the resulting notion of bisimulation is studied, and 
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established to coincide with the standard notion of timed bisimulation. Further, 
decidability of timed bisimulation in the setting of finite processes is shown. In 
Sect. 6 the result on decidability of timed bisimulation is applied to time safe 
Petri nets. 

2 Timed Net Processes 

In this section, we shortly define some terminology concerning timed net pro- 
cesses. ‘Timed net process’ is a timed extension of an occurrence net (e.g. [8]) 
by associating its events with two timing constraints that indicate earliest and 
latest occurrence times both with regard to a global clock. Events once ready - 
i.e., all their causal predecessors have occurred and their timing constraints are 
respected - are forced to occur, provided they are not disabled by others events. 
A timed net process progresses through a sequence of states by occurring events 
at a certain time moment. An event occurrence takes no time. 

We start with the well-known concept of a net. A net is a triple [B,E,G), 
where B is a set of conditions; E is a set of events [B C\ E = 0); G Q [B x E) U 
[E X B) is the flow relation [E C dom[G) fi cod[G)). 

For xeB\jE,*x = {yeByjE\{y,x)e G} and x* = {y e BUE \ (x, y) e 
G} denote the preset and postset of x, respectively. Note, the definition above 
exclude events with *e = 0 or e* = 0. A net (5, E, G) is acyclic, if G*+ ((7+ is the 
transitive closure of G) is acyclic; (5, E, G) is finitary, if for all x £ B U E the 
set {y £ B U E \ y G*+ x} is finite. A net [B' , E' , G') is a subnet of [B, E, G), if 
B' C B, E' C E, G' C X E'yjE' X B'). x,y £ BUE are in conflict iS there 

exist distinct events ei,C2 € E such that *ci fi* 62 7^ 0 and (ei, x), (c2, y) € G* 
(G* is the reflexive and transitive closure of G). 

A net process is an acyclic finitary net N = [B, E, G) such that |* fe |< 1 for 
all b £ B and -i(e e) for all e € E. 

Let *N = {& e i? |*& = 0} (the set of input conditions of N) and N* = {b £ 

I fe* = 0} (the set of output conditions of A). 

A computation of a net process A = {B, E, G) is a subnet tt = {Bj^, F-V, G^^) 
of N such that *7 t =* N and | &* |< 1 for all b £ Bt^. The initial computation ttn 
is one with FVjy = 0. Let Gomp[N) denote the set of computations of A. For 
e £ E and tt € Gomp[lS), e is enabled after tt if *e C tt*, otherwise it is disabled. 
Let En^ir) be the set of events, enabled after tt. If some e £ E is enabled after 
TT, we can extend tt to a process by adding the event e and its postset e*. We 
write 7T — ^ 7 t' in this case. 

Let Act = {a, ui, 02 • • •} be a set of actions. A net process (labelled over Act) 
is a tuple A = )B, E, G, L), where A is a net process and L : E ^ Act is a 
labelling function. We define Act^ = {a € Act | 3 e € F « /(e) = a). 

Before introducing the basic concepts of timed net processes we need to 
consider some auxiliary notations. Let N be the set of natural numbers and 
Rg the set of nonnegative real numbers. We use d, possibly subscripted and/or 
primed, to range over R)/. We now come to the definition of timed net processes 
labelled over Act. 
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Definition 1 A timed net process (labelled over Act) is a tupleTN = {N = {B, 
E, G, L), Eot, Lot), where N is a net process (labelled over Act ); Eot, Lot : E —1 
N are functions of the earliest and latest occurrence times of events, satisfying 
Eot)e) < Lot[e) for all e £ E, 

Figure 1 shows a simple example of a timed net process, where a pair of 
numbers near by an event corresponds to its earliest and latest occurrence times. 




Let L{TN) = [E R+] be the set of time assignments for events from E. 
Given r <E L{TN), we let A{t) = sup{r(e) | e <E E}. 

A state of TN is a pair (7r,r), where tt is a computation and r € L{TN). 
The initial state of TN is a pair (ttjv, Ttat), where 7rAr(e) = 0 for all e G E. 

The states of TN change, if an event occurs at some global time moment. 

In a state (7r,r), an event e may occur at a time moment d G R||, if e G tt*, 
A{t) < d, Eot(e) < d and d < Lot[e') for all e' G En^ir). In this case, the state 
[tt' ,t') is obtained by occurring an event e at a time moment d G R|| (denoted 

(7T,r) if TT -Aj). jy' , r' |i,;\{e}= "T, and 'r'(e) = d. A state is 

reachable if either (7r,r) = or there exists a reachable state [n' , t') 

such that [tt',t') (’^n) for some e £ E and d G R||. We use RSfTN) to 

denote the set of all reachable states of TN . 

A timed word, of an alphabet Act over Rg" is a finite sequence of pairs: w = 
(ai,di) {a, 2 ,d, 2 ) . . . (an,dn), where for all 1 < i < n, G Act, di G R||, and 



if (7T,r) [tv' , r') and 
, dn) is a finite sequence 



In 



furthermore di < di^i. We shall write (tt, t) AN. [jif 
L(e) = a. A run r of a timed word w = (ai, di) . . . 

of the form: r = {ttn,ttn) (7Ti,ri) ... (7r„_i,r„_i) ^ 

this case, we say that (7r„,r„) is reachable by a timed word w. 

As an illustration, we construct the set of the timed words corresponding 
to the runs of the timed net process TNi (see Fig. 1): {(ai,di), [ai,d\)[a2,d2), 
(ai,di)(a3,d3) | 0 < di < 1 , d,2 = 2 , 0 < ^3 < 2 , di < d3}. 
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3 A Category of Timed Net Processes 

In this section, we define a category of timed net processes and an accompanying 
path (sub)category of timed words. 

The morphisms of our model category will be simulation morphisms follow- 
ing the approach of [12]. This leads to the following definition of a morphism, 
consisting of a relation between conditions of the simulated system and simu- 
lating conditions of the other, and a function, mapping events of the simulated 
system to simulating events of the other, satisfying some further requirements. 

Definition 2 A morphism between timed net processes TN = [N = [B, E, G, 
L), Eot, Lot) and TN' = {N' ={B', E' , G', U), Eot' , Lot'), (A,/x) : TN 
TN' , consists of a relation X Q B x B' and a partial function /x : T’ — 1 E' such 
that: 

— *7rjv' = A *7rjv; 

— A *e = *(/i(e)) and A e* = (/x(e))* for all e € E; 

— p[e) = p[e') =y e = e' for all e,e' G TV and tt G Gomp[N); 

— L' o p = L; 

— Eot'[pfe)) < Eot[e) and Lot'[e) < Lot[p[e)) for all e G E. 

As an illustration, consider a morphism from the timed net process TN 2 
in Fig. 2 to the timed net process TNi in Fig.l mapping conditions to hi 
(1 < i < 4) and events e'- to e^- (1 < j < 3). It is easy to check that the 
constraints in Definition 2 are satisfied. 

From now on, we use (A, /x) • tt to denote the application of a morphism (A, /x) 
to a computation tt of some timed net process. 




Fig. 2. 



Let us consider a simulation property of a morphism defined prior to that. 



Theorem 1 Given a morphism (A,/x) : TN —1 TN' and a timed word (ai,di) 



{an,dn). // (tTjv = T^O, TTN = A)) (a1,Ti) 



(7T„,r„) is a run in TN, then (ttjv' = (A,/x) • tvo,ttNi = h) 



7Ti,r{) ... ((A,/x) • 7T„_i,r4_i) 



(a„., cG) 



((A,/x) • TTn,Tf) is a run in TN' . 



Proof Sketch. It is straightforward by induction on n, using the definitions of 
a computation, a morphism and the relation {tv,t) )tv',t'). □ 
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Thus, in the formal sense of Theorem 1 we have shown that the morphisms 
from Definition 2 do represent a notion of simulation. Now, define a category of 
timed net processes as follows. 

Definition 3 Timed net processes (labelled over Act) with morphisms between 
them form a category of timed net processes CTM Act, in which the composition 
of two morphisms (Ai,/xi) : TNo — t TN\ and '■ TN\ — t TN 2 is 

(A 2 o Ai,/i 2 o pi) : TNq — 1 TN 2 , and the identity morphism has the form 
{IbAe) where 1 b and 1 b are the identities on the condition- sets and event- 
sets, respectively. 



Proposition 1 CTJA Act is a category. 

Following the standards of timed net processes and the paper [12], we would 
like to choose timed words with word extension so as to form a subcategory 
of CTJA Act- For each timed word w, we shall construct a timed net process as 
follows. 

Definition 4 Given a timed word w = (ui, di)(a 2 , ^ 2 ) • • • (on, dn); we define 
a timed net process TNw = [Nyj = [Byj,Eyj,Gyj,Lyj),Eotyj,Lotyj) as follows: 
Eyj = • • • ,n',n' + 1}; = {1,2,- • • ,n}; = {(i',i), {i,i' +1) | 1 < 

i,i' < n}; Lyj[i) = Oi, (i = 1,2 ■ ■ ■ n); Eotyj[i) = Lotyjfi) = di (i = 1,2 ■■ - n). 

The purpose of the construction is to represent the category of timed words 
with extension inside CT-hf Act, and to identify runs of w in TN with morphisms 
from TJMw to TJM , as expressed formally in the following two results. 

Proposition 2 The construction of the timed net process TN-„j from, a timed 
word w extends to a full and faithful functor from the category of timed words 
(as objects) and word extensions (as morphisms) into CT-M Act- 



Theorem 2 Gonsider a timed net process TJM and a timed word w = (ai,di) 

• • • [an,dn)- For all the runs of w in TN, {ttn = eq,ttn = Ti) (7Ti,ri) 

... r„_i) ^ i'En,Tn) such that -Ep- Wi (0 < i < n), we can 

associate a morphism (A,/x) : TN-u, TN such that p[i) = e^. Furthermore, 
this association is a bijection between the runs of w in TN and morphisms 
(A,/i) :TN^ -^TN. 

Proof Sketch. It directly follows from the definitions of a run of w and a 
morphism. □ 

4 TM^-Open Morphisms 

Given our categories of timed net processes and timed words, we can apply the 
general framework from [12], defining a notion of i/'IF-open map. 
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Definition 5 A morphism (A,/x) : TN —1 TN' in CTMacI is TW-open iff 
for all timed words w and w' , and morphisms such tha,t the following diagram 
commutes: 

TN^ TN 




TNyjr TN' 

(A",/x") 

there exists a morphism (A, /I) : TNw' TN such tha,t in the diagram 

(A',/xO 

TN^ TN 

(A,/i) 

TN^, TN' 

the two triangles commute. 

Our next aim is to characterize i/W-openness of morphisms. 

Theorem 3 Let and ((A,/x)- 7 ri,r{) be reachable by w inTN andTN' 

respectively. A morphism (A,/r) : TN —1 TN' is TW-open iff whenever ((A,/r) • 

7 Ti, r{) ( 7 T 2 , r^) in TN' then (tti, ri) {'^2^2) inTN and (A, /x) • 7x2 = 7X2 . 

Proof Sketch. It follows similar lines as other standard proofs of the charac- 
terization the openness of a morphism (see e.g., [11]), using the definition of a 
morphism and Theorem 2. □ 

We do not require for the category CTAf Act to have pullbacks. The following 
weaker result suffices. 

Theorem 4 Given two TW-open morphisms (Ai,/xi) : TN\ —1 TN and 
{X2W2) '■ TN2 —1 TN . There exists a timed net process TNx and TW-open 
morphisms (A),/x)) : TNx TN\, (A^,//^) : TNx TN2 such tha,t the dia- 
gram commutes: 

(A2, M2) 

TNx TN2 





(Ai, /xi) 



TN 
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Proof Sketch. See Appendix. □ 

We next consider the decidability question for openness of a morphism in 
the setting of finite timed processes, i.e. processes with finite sets of B and E. 
The subclass of timed net processes is denoted by TNj^. As for many existing 
results for timed models, including results concerning verification of real-time 
systems, our decision procedure relies heavily on the idea behind regions [2], 
which essentially provides a finite description of the state-space of timed net 
processes. 

Given a timed net process TN and r, r' € T(i/'A), we let r ~ r' iff (i) 
for each e £ E it holds: [^(e)] = [^'(e)], and (ii) for each e,e' £ E it holds: 
;r(e); < lr{e')l lr'{e)l < lr'{e')l, and lr{e)l = 0 = 0. Here, for 

d € Rg", idl and [dj denote its fractional and smallest integer parts, respectively. 
For r € r{TN), let [r] denote the region to which it belongs. 

An extended state of a timed net process TN is defined as a pair (tt, [r]), 
where (7r,r) € RS[TN). We consider {tttn ,[t~tn]) as the initial extended state 

of TN. For extended states (tt, [r]), [ tt ' , [r^]), we shall write (tt, [r]) [ tt ' , [r']), 

if (7T,r) [tt'jT''). An extended state (7r,[r]) is called reachable by a timed 

word w, if is reachable by a timed word w. 

We can now give a characterization of i/W-open maps in terms of extended 
states. 

Theorem 5 Let TN 1 /TN 2 € and (7ri,[ri]), ((A,/x)- 7Ti,[r(]) be extended 

states reachable by vj inTNi andTN 2 , respectively. A morphism {X,ti) : TN\ — 1 

TN 2 is TW-open iff whenever ((A,/x) • 7Ti,[r(]) ("^ 2 ; [''“ 2 ]) TN 2 , then 

(tti, [ri]) (7T2, [t 2 ]) in TNi and {\, fi) ■ 7T2 = 

Proof Sketch. It follows from Theorem 3 and the definition of a region. □ 
Corollrtry 1 Openness of a morphism is decidable between TN,TN' € TN/. 



5 Timed Bisimulation 



In this section, we first introduce a notion of i/'IF-bisimulation, using the concept 
of i/'IF-open map. Then the standard notion of timed bisimulation is defined in 
terms of states of timed net processes. Further, the coincidence of the bisimilarity 
notions is shown. Finally, decidability of timed bisimulation is demonstrated for 
finite timed net processes. 

As was reported in [12], the TIF-open map approach provides a general 
concept of bisimilarity for any categorical model of computation. The definition 
is given in terms of spans of TW -open maps. 



Definition 6 Timed net processes TNi and TN 2 are TW -bisimilar iff there 
exists a span TNi TN TN 2 with vertex TN of TW -open mmrphi.sm,s. 
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Following the approach of [12], it is easy to show that TLF-bisimulation is 
exactly the equivalence generated by i/W-open maps, using Theorem 4. 

Further, the notion of timed bisimulation [7] is defined in terms of states of 
timed net processes as follows. 



Definition 7 Two timed net processes TNi and TN2 are timed hisimilar iff 
there exists a relation B C RSfTNi) x RS(TN2), satisfying the following condi- 
tions: € B and for all (( 7 Ti,ri), ( 7 T 2 ,r 2 )) e B it holds: 



(a) */( 7 Ti,ri) ( 7 r[,r{) inTJMi, then [112, T2) {nf 

€ B for some € RS{TN2); 

(^) if{T^2,T2) ( 7 T 2 ,r 2 ) inTN2, then (tt], 

{{A^'^i)A'^2^'^2)) e B for some (7r[,r{) € RSfTNi). 



in TIS2 and 
in TNi and 



Finally, the coincidence of the bisimilarity notions is established. 



Theorem 6 Timed net processes TN\ and TN2 are TW -hisimilar iff they are 
timed hisimilar. 



Proof Sketch. See Appendix. 



□ 




Fig. 3 . 



The timed net processes TNi and TN2, shown in Fig. 1 and 2 respectively, 
are not bisimilar. Next, consider the timed net processes in Fig. 3. It is easy to see 
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that there are morphisms from TN3 to TN4 and to TN3, and these morphisms 
are i/W-open. Hence we have a span of i/W-open maps between TN 4 and TN 3 . 
Bisimilarity between TN 4 and TNs follows from Theorem 6. 



Theorem 7 Let TJMi,TN2 € TN/. If there exists a span of TW -open maps 
TNi TN TN2 then there exists TN' e TN_^ of size bounded by 



the size ofTNi andTN2 and with TW -open morphisms T N i 
TN 2 . 






Proof Sketch. Since TNi TN2 is a span of TW -open maps, 

then TNi and TN 2 are i/ W-bisimilar, by Theorem 6. This means that there 
exists a timed bisimulation B between states of TNi and TN 2 - Using B we 
construct TN' as in the converse part of the proof of Theorem 6. From the 
construction it follows that TN' € TNj^. The number of extended states of TN' 
is bounded by | E |! • • (c+ where | E |=| T’l | * | E 2 \ (| Ei \ is the 

number of events in TNi (* = U2)), | E |=| Bi \ * \ B 2 \ {\ Bi \ \s the number 
of conditions in TNi {i = U2)), and c is the greatest constant appearing in the 
time constraints in TNi and TN 2 . □ 

Corollrtry 2 Timed bisimulation is decidable for TN,TN' <E TN/. 



6 Time Petri Nets 

Time Petri nets were introduced in [16], Following the reasoning of [4], we con- 
sider time Petri nets with discrete time. 

We start with the well-known concept of a Petri net. A Petri net (labelled over 
Act) is a tuple TV” = {P, T, E, mo, 1), where (P, T, E) is a net (with a finite set P 
of conditions called places, a finite set T of events called transitions {PDT = 0), 
and the flow relation PC )P x T) U {T X P)); mo C P is the initial marking; 
I : T ^ Act is a labelling function. For t € T, we let *t = {p £ P \ {p,t) € E} 
and t* = {p £ P \ {t,p) € E}. To simplify the presentation, we assume that 
*tnt* = 0 for every transition t. A marking m of TW is any subset of P. A 
transition t is enabled in a marking m if C m (all its input places have tokens 
in m), otherwise it is disabled. Let En[m) be the set of transitions, enabled in m. 
If t € En[m) then it may fire, and its firing leads to a new marking m' (denoted 
m m') defined by m' = (m \* t) U t*. A marking m is reachable if m = mo or 
there exists a reachable marking m' such that m' m. A Petri net is called safe 
if for every reachable marking m and for every t € En[m) it holds: t* fi m = 0. 

Let J\f = )P,T, E,mo,l) be a (labelled) Petri net and N = [B,E,G,L) be 
a (labelled) net process. Then a mapping f : [B E) — t (P U T) is called 
homomorphism iff 4>{B) C P, (f>[E) C T and for all e € P the following hold: (i) 
the restriction of f to *e is a bijection between *e and *<0(e); (ii) the restriction 
of 4> to e* is a bijection between e* and <0(e)*; (iii) the restriction of 4> to *N is 
a bijection between *N and mo; (vi) L[e) = l[(f>[e)). A pair [N,(f>) is called a 
process of W iff <0 is a homomorphism from N to W. For each Petri net W, there 
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exists a unique (up to renaming of conditions and events) maximal process, where 
‘maximal’ is related to the prefix ordering (cf. [8]). The McMillan-unfolding of TV 
(denoted McM (TV)) was defined in [19] as a finite prefix of the maximal process 
of TV such that each reachable marking of TV occurs as an image of output 
conditions of some computation of this prefix. It can be shown to be unique and 
finite. 

The time Petri net is the Petri net whose transitions are labelled by their 
earliest and latest firing times that denote the minimal and maximal, respec- 
tively, number of time units which may pass between the enabling of and the 
occurrence of the transitions. 



Definition 8 A time Petri net (labelled over Act) is a tuple TAf = (TV = {P, 
T , F , mo, 1), eft, I ft), where M = [P, T, F , mo, 1) is a safe Petri net (labelled 
over Act) and eft,lft : T' — 1 N are functions of the earliest and latest firing 
times of transitions, satisfying eft)t) < lft(t) for all t £ T . 



Let V((TAf) = [T' — 1 N] be the set of time assignments for transitions from T. 
A state of TTV is a pair (m, n), where m is a marking and n G V[TAf). The initial 
state of TTV is a pair [mo, no), where vo[t) = 0 for all t <E T. In a state [m,v), 
a transition t <E T may fire after a delay h G N if t € En[m), eft[t) < S, and 
6 < lft[t') for all t' G Fn[m). In this case, the state [m' , F) is obtained by firing t 

after a delay 6 (written (m, n) [m' , F)), if m A- m' and for all fi G T it holds: 

n'[t') = 0 if V G Fn[m') \ Fn[m), n'[t') = S if t' = t, otherwise F[F) = n[t'). 
A state [m,v) is reachable if [m,v) = [mo, no) or there exists a reachable state 

[m',n') such that [m',n') [m,n) for some t G T and h G N. Let RS[TAf) 

denote the set of all reachable states of 7~TV. A run r in TTV is a sequence of the 



form: [mo, no) [nii,ni) . . . [m„_i,n„_i) [m„,n„) . . .. To guarantee 

that in any run, time is increasing beyond any bound, we need the following 
progress condition: for every set of transitions {ti, ^ 2 , • • • , tn} such that V 1 < i < 
n « T* n V 0 E^iid T* n *Ti V 0 it holds 5^i<i<„e/t(L) > 0. In the sequel, 
TTV will always denote a time Petri net satisfying the progress condition. 

Let TTV = (TV = [P, T, F, mo,l), eft, I ft) be a time Petri net, and (TV, (f>) be 
a process of TV. Then 



— a mapping a : L-V — ^ N is called a timing of A ; 

— if T C Bpj and t G Fn[4>[B)) , then the time of enabling for t in B under a is 
given by: TOT,j(T, t) = max({cr(e) | {e} =*fe, fe G [B\‘ N), (f>[h) € *t}U{0}); 

— a timing a is valid iff for all e G Tjv the following holds: eft[4>[e)) < a[e) — 
TOEg-[*e, 0(e)) < lft[(j>[e)). Let VT[N , 0) denote the set of valid timings of 

(iv, 0 ); 

— [T[N),(f>) is called a timed process of TTV ifFT'(A) = Eot^, Lota-) \ 

a G VT[N, 0)), where 

• Na = [B, E, G, L) with 

* Bn„ = {hcr\be Tat} U {H Bj^}] 

* Bn„ = {Cff I e G iTjv} U [Ca I Ca iTjv}; 
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* Fn^ = \ {e,b),{b,e) € Fn} U {{b,e^)}U 

{(CfT, b) \ b e *N}; 

* LN^{e^) = FN{e) for all e (itV^ \{e^}), LN^{y) = a{a^ ActN); 

• FotN^{ea) = o-(e) for all e \ {e^}), EotN^{y) = 0; 

• EotN^{e^) = a(e) for all e (Ej^^ \ {e^}), LotN^{e^) = 0. 

We call T[N) the time- expansion of a process (IV, 4>). 

Proposition 3 LetTN = (N',eft,lft) be a time Petri net, {N,(f)) be a process 
of N , and McM[Jf) = {Nu,4>u) be the McMillan-unfolding of Jf . Then 

(i) T[N) is a timed net process; 

(a) if N is finite, then T{N) is finite; 

(in) if [m,v) € RS(Tff) and t may fire after 6 in then there exists 

(7T,r) € RSflfNu)) and e^ € Et[n„) such that fu {b \ ba- £ tt*} = m, 

4>u{F) = t, and Co- may occur at ((f + TOEa-{’ e,f)) in (7r,r), for some 

aeVT{Nu,fu). 

We say that two time Petri nets are timed bisimilar iff the time expansions 
of their maximal processes are timed bisimilar. 

Given time Petri nets Tff = {ff,eft,lft) and TJf' = {ff , eft' ,lft'), a 
procedure for checking timed bisimulation between T Jf and Tff consists of 
the following steps: (1) constructing the McMillan-unfolding of M and ff , 
McM[J\f) = and McM[J\f) = respectively; (2) computing 

the sets of valid timings of {Nu,<t>u) and [Nf, ()>'„), VT[Nu,4>u) and VT[N 
respectively; (3) constructing the time net processes T[Nu) and T{N(), respec- 
tively; (4) checking timed bisimulation between i/'(A'„) and T[Nf). 

Make some remarks on the complexity of the above procedure: the complex- 
ity of the construction of the McMillan-unfolding of a safe Petri net is polynomial 
in the net’s size [9], the complexity of the computation of the set of valid tim- 
ings is exponential in the size of a time Petri net, and hence, the size of the 
constructed timed net process is also exponential; the complexity of checking for 
timed bisimulation between timed net processes is exponential in their sizes (see 
the sketch proof of Theorem 7), however, for Petri nets with discrete time it can 
be simplified significantly. 

Appendix 

Proof Sketch of Theorem 4. We first construct TNx = {N , Eot, Lot) as 
follows: 

- Ntnx = U(A'^jX 7T2 I e Proc{Ni), Btt € Proc[N) ® tt = (Ai,/Xi) • tti [i = 

1,2)), where Atti X7T2 = {B,E,G,L) with 

• f^TTi X7T2 {(^ 1 ^ 2 ) I ^ F-Jii , 3e G £/7T <> e Pi (g) if 

• B,riXTT2 = {{bi,b2) I h e 3be B,, xb= Xi{bi) {i = 1,2)}; 

• Gxixw 2 = {{{bi,b2), (61,62)) I {bi,ei) e for some i = 1 , 2 } U 

{((61,62), {bi,b 2 )) I {ei,bi) e for some i = 1,2}; 

• G7riX7r2((6l,62)) = L]y^{ei) = Lx2{^2)', 
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- EotTNx{{^l,^2)) = max{i?otTAfi(ei),i?otTAf2(e2)}; 

- L20L_/>a'x(( 61,62)) = min{Lot'_/>A'i(6i),L2ot'_/>A'2(62)}. 

Basing on the construction, it can be shown that TNx is a timed net pro- 
cess. Define mappings (A),/x)) : TNx TNi as follows: A)((fei,fe2)) = h and 
/x)(( 6 i, 62)) = Cj (i = 1,2). According to the construction of TNx, these map- 
pings are morphisms. Moreover, it holds that (A),/x)) o (Ai,/xi) = (A^,/^^) o 
(A2,/X2)- This implies that the diagram (see Theorem 4) commutes. Further us- 
ing the construction of TNx £^nd Theorem 3, it is straightforward to show that 
(A),/i() : TNx TNi is a i/W-open morphism for i = 1,2. □ 

Proof Sketch of Theorem 6. 

(=k) Let TNi ^ ^ TN 2 be a span of TLF-open maps. Define 

a relation B as follows: B = {((tti, r^), (712, T2)) | {TTi,Ti) € RS[TNi), 3(7r,r) € 
RS{TJ)[) o ■ 7T = 7Ti,r \e^= Ti {i = 1,2)}. We then have 

((ttjvj , 0), ((7Tjv2 , 0)) € B. Since (Ai,/xi) and {^ 2 , 1 ^ 2 ) are TLF-open morphisms, it 
is straightforward to show that S is a timed bisimulation, using Theorem 3. 

(4=) Assume be a timed bisimulation between TN\ and TN 2 - We first 
construct TN = [N, Eot, Lot) as follows: 

- Aj’jv = {B,E,G,L) with 

• Entn = {(61,62) I {TVi,Ti) {tv),tI) in TNi (* = 1 , 2 ), 

(^{,'62)) € B, L'iWi(6l) = L'iW2(62)}; 

• Entn = {(^1, ^2) I 3(ci, 62) e Exj-x o h e* CiV bi e e*, (i = 1,2)}; 

• Gntn = {((^1,^2), (61, 62)) I bi e IT-, {bi,ei) e Gxi for some i = 1 , 2 } U 

{((61, 62), (hi, 62)) I h e 7 t}*, (ei,hi) e Gx, for some i = 1,2}; 

• Lxtn{{^iN 2)) = Lxt{ei) = Lx2{^2)', 

- EotTx{{ei,e2)) = TXia}L{EotTXi{ei),EotTX2{'^2)}', 

- LotTx{{ei,e2)) = mm{LotTXi{ei), LotTX2{e2)}. 

Basing on the construction it can be shown that TN is a timed net process. 
Define mappings (Ai,/Xi) : TN —1 TNi as Xi{{bi,b 2 )) = E and /ii((ei,e2)) = 

(i = 1,2). From the construction of TN , it follows that these mappings are 
morphisms. Further using the definition of timed bisimulation, the construction 
of TN, and Theorem 3, it is straightforward to show that (Ai,/i.i) : TN —1 TNi 
is a i/W-open morphism for i = 1,2. □ 
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Abstract. The paper presents a formulation for the problem of mapping parallel 
programs on heterogeneous networks and proposes a distributed recursive 
(heuristic) algorithm for its solution. This algorithm doesn’t require global 
knowledge of computational state, it uses only information obtained from a 
neighbours nodes. Parallel programs and networks are presented as weighted 
graphs. In each stage graph bisection strategy is used to divide all processes of 
the program on two groups, which farther may be sent to neighbour nodes, 
where algorithm will be continued in the same way, or be leaved on initial 
node. 



1. Introduction 

The task is to distribute processes of a parallel program in a heterogeneous network as 
to minimize execution time of this program. It would allow adapting the same 
programs to different network topologies increasing program effectiveness and 
decreasing communication costs. 



2. Models 

It is necessary to examine structures of different parallel programs to solve mapping 
problem. Therefore we should choose a model by which a lot of parallel programs 
can be presented, i.e. it must be abstract enough. The abstraction of particular 
network details enhances a model's architecture independence. This enables 
algorithms and software to be portable across several types of parallel systems. Graph 
model provides complete abstraction from the explicit expression of parallelism and 
details of communication and synchronization. 



2.1. Graph Model of the Program 

A program graph is a directed graph G—(V, E) with 
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• V=[(v., cVj)} - a set of vertices v. with complexity of cv., 

• E= { (e^, ie^) } - a set of arcs e={ Vj) with communication cost of ie^. 

Let a process complexity be the number of instructions a process will do from its 
initiation to completion. Let a communication cost from process Vj to Vj be a number of 
bytes that process Vj sent to Vj during one execution of a program. Then 
communication cost ie between processes Vj and Vj be the sum of communication costs 
Vi — ^ Vj and Vj — ^ Vj. 

2.2. Graph Model of the Network 

The network is modelled by a graph S=(W, Q), where IT= { (vvi, /j) } is a set of vertices 
(nodes, computers) with weight of/i, and Q=[(q[, gi)} is a set of arcs (channels) with 
weight of g[. As well as for program graph, it is necessary to estimate computer 
capacities and speed of channels to nominate weights of vertices and arcs. 

2.2.1. Computer Capacity Estimate 

Computer capacity unit of measurements is the time: the computer, which is carrying 
out same volume of work for smaller time is faster. The time of performance of any 
program is measured in seconds. Frequently productivity is measured as speed of 
occurrence of some number of events per one second, so the smaller time means large 
productivity. 

The time of the central processor for some program can be expressed by two ways: 
by amount of synchronization steps for the given program multiplied on 
synchronization step duration, or amount of synchronization steps for the given 
program, divided on frequency of synchronization. The important characteristic of the 
processor is the average amount of clock cycles per instruction (CPI). With known 
amount of carried out instructions in the program this parameter allows to estimate 
time of the central processor for the given program. 

Thus, the productivity of the central processor depends on three parameters: 
frequencies of synchronization (U, average amount of steps per instruction A. and 
amount of carried out instructions. When two computers compared it is necessary to 
consider all three components to get relative capacity. 

Thus, executions time if process Vj on processor is 

( 1 ) 




But the task executing on the computer occupies processor and resources. 
Therefore computer capacity for other tasks becomes lower. Then execution time of 
process Vj on processor q^ is 



T; = + U , where u= / TV, - execution time of previous tasks. 



( 2 ) 
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2.2.2 Channel Speed Estimates 

The information transfer speed in a network depends on latency and throughput of 
channels. 

Latency is a transfer time of the empty message. Using this size one can measure 
software delays (time of accessing to a network on the side of the sender and 
receiver). 

Throughput limits from above volume of the traffic which sender can transfer to the 
receiver for a time unit. With measurement of this characteristic the opportunity of a 
competition of several flows for throughput should be taken into account. 

According to the entered above definitions, the transfer time of the message 
containing length bit on direct connection from one computer to other is defined 
(determined) as 

t - latency + length / throughput « 

wJ 

The productivity of a network is worsened in conditions of an overload (when there 
are many messages simultaneously in a network). How the overload will affect on the 
entered characteristics depends on technology of a particular network. 



3. Execution Time Optimization 

Execution time of a parallel program is the time that elapses from when the first 
processor starts executing on the problem to when the last processor completes 
execution. 

During execution, each processor is computing, communicating, or idling Hence, 
total execution time T can be defined as the sum of computation, communication, and 
idle times of all p processors: 

... \ (4) 

T = > T' +T‘ +T‘ 

comp comm idle j ’ 

Where are the time spent computing, communicating, and 

idling, respectively, on the ith processor. 

Taking into account graph model of a program and a network: 

. ^ ie ^ 

= S cv,j *fi , t;' = E ^ ’ 

j i: V SI L 

Where process Vtj was placed on the vertex W[ for each j (processes with numbers 
Ji ^ Ji placed to the same vertex of a network). 

Both computation and communication times specified explicitly in a parallel 
algorithm; hence, it is generally straightforward to determine their contribution to 
execution time. Idle time can be more difficult to determine, however, since it often 
depends on the order in which operations are performed. 
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So we can define two strategies in placing processes to the processors. 

1 . We place different tasks on different processors to decrease computation time. 

2 . We place tasks that communicate frequently on the same processor to decrease 
communication time. 

Clearly, these two strategies will sometimes conflict, in which case our design will 
involve tradeoffs. In addition, resource limitations may restrict the number of tasks 
that can be placed on a single processor. 

The mapping problem is known to be NP-complete. Therefore it should be solved 
approximately or using heuristic knowledge about program and network. 



3.1. Recursive Graph Partitioning 

In recursive bisection, we partition a domain (e.g., a finite element grid) into sub- 
domains of approximately equal computational cost while attempting to minimize 
communication costs, that is, the number of channels crossing task boundaries. 

HcVi.i ^ 2 , where (v- ^ ,CV - ^ )eG,,k = 1,2 ( 6 ) 

i 3 

min , where e G ( 7 ) 

n 

The domain is first cut in one dimension to yield two sub-domains. Cuts are then 
made recursively in the new sub-domains until we have as many sub-domains as we 
require tasks. Notice that this recursive strategy allows the partitioning algorithm itself 
to be executed in parallel. 

Partitioning into two approximately equal parts is oriented to the parallel computers 
in which all processes are the same productivity and channels are quick. In the 
networks it is expediently to accentuate on minimizing sub-domains communication 
because of low communication speed. We should find a minimum cut of the graph G 
to solve this problem. But this task has very high complexity and we can’t use an exact 
solution here. 

We will use connectivity information to reduce the number of edges crossing sub- 
domain boundaries, and hence to reduce communication requirements. First we should 
identify two extremities of the graph Vi and V2, that is, the two vertices that are the 
most separated in terms of graph distance. (The graph distance between two vertices is 
the smallest number of edges that must be traversed to go between them.) Include 
extremities in Gj and G2. Then for each of remaining vertices calculate its distances to 
the extremities (ri and r2). If r2>ri then vertex is included into Gj, if r2>ri then in G2. 
And if ri= r2 then the decision about vertex place is taken on the basis of heuristic 
knowledge (vertex is included in the smallest graph or in the graph with more incident 
vertices). 

The complexity of this algorithm is O(N^) in the worst case. The exactness of the 
solution depends on initial graph class. 
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3.2. Optimizing Algorithm 



Mapping algorithm will be developed within the model "the hierarchical 
manager/worker". On a first step there are given two graphs Gq - program graph and 
S - network graph. It is necessary to map Gq S with minimum execution time. 

1 . Let all processes work on one computer. Program performance time estimation is 
f Q = ^ CVj * f . This is a lower hound of T, since the processes interaction does 

i 

not take into account (one process can suspend execution before reception of the 
data from other process, and it increases general execution time). 

2. Let divide Gq into two graphs Gq and Gq using recursive graph partitioning 
and calculate program execution time on two nodes Wj and connected by arc 



fj = max| I latency -\ — -* M | . 

§ 



( 8 ) 



where M is a number of arcs between Gq and Gq . 

Since the graph of a network can be rather large, it would be not effective to 
examine all pairs of computers, on which the received groups of processes could be 
executed. Therefore we shall limit ourselves by neighbours of the current computer, 
i.e. having direct connection with it. It will strongly reduce complexity of algorithm, 
though reducing accuracy. Let he the minimum among all the nodes connected to 
the initiator. 

3. If < ?Q then all processes will be placed on two computers. 



M 



If ^0 S 1 + 



k=\ 



, then dividing program in two groups, we 



will reduce the whole program execution time T, so all processes will be placed on 
two computers. Go to step 5. 

4. The program will he executed on initiator. End. 

5. Continue algorithm for node Wj and w^. 



On the second step we divide initial graph Gq getting two new graphs Gq and 
Gq . After it a number of arcs connecting Gq and Gq become lost to this graphs. 
After the first division, they are available through the graph Gq but further new 
groups of processes will not have an access to initiator. Sending Gq with new graphs 
is not effective because Gq C Gq . It is more convenient to get Gq from Gq by 
indicating in each node, which graph Gq or Gq it is included by (before sending 
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Gq on processor Wi we will write on Gq vertices processor identifier). Therefore, in 
each step we will know where all groups of processes are situated. Such program 
graph in vertices of which is written where this vertex is situated will he referred as 
mapping graph. 

Lets demonstrate on the simple example how this algorithm works. The initial 
graph is Gq (homogenous grid). Network consists of six nodes, where node 5, which 
is more powerful than others, is the initiator (Fig. 1). 






l1l 




m 

@1 


m 




El 



a) program graph G„ 




Fig. 1. Program and network graphs 



On the first step we divide vertices of Gq on two groups, mark them and send one 
copy of the marked graph to node 1 (Fig. 2). This means that executing processes on 
two nodes (1 and 5) will take less time than executing on one node. 



(5) (5) (5) (T) 

(D — ® — O — () 

(5) (1) (j) 0 



a) copy of mapping 
graph in the node 5 



(5) (5) 0 0 

( 5 ) 0 0 0 

© 0 ® 0 



b) copy of mapping 
graph in the node 1 



Fig. 2. First step of the mapping algorithm 

On the second step node 1 divides its graph, remarks it and sends copy of new 
marked graph to the node 6. As node 5 is more powerful, it leaves all processes with 
mark “5” to be executed on it. On this step we can see two different versions of 
mapping graph in a network (Fig. 3). 

The problem of different copies reduces accuracy of algorithm. However, the error 
will be small because the processes with direct interconnection placed on the same or 
close nodes. 

Resulting distribution, received by given algorithm, and exact solution, received by 
exhaustive search, are shown on Fig. 4. 

As one can see the difference between exact and heuristic solution is one arc. 
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a) copy of mapping graph b) copy of mapping graph 

m node 5 nodes 1 and 6 



Fig. 3. Graph copies on the last step 




Fig. 4. Resulting distribution (left) and exact solution (right). 



Conclusions and Future Work 

The distributed heuristic algorithm presented in this paper uses classical presentation 
of parallel programs and networks as weighted graphs and well-known graph 
bisection strategy (which though can be replaced by another one). But in 
implementation of this algorithm one should use more complex presentation of 
networks (for example, to take into account features of bus networks or stars with 
hubs). To prevent deadlocks the result program can be organized as hierarchical 
manager/worker, where each level of hierarchy has its own number mark. Node with 
larger marks can’t be the manager to the node with smaller mark. 

Future work will be the implementation of presented algorithm and detailed 
investigation of its behaviour for different applications and networks. 
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Abstract. In the present work, fine-grain parallel applications of a wide class to 
be executed on distributed-memory multiprocessors are investigated. The prob- 
lem of determining the maximum degree of parallelism in the control flow of 
such applications is under consideration. An efficient procedure (called R- 
procedure) for solving the indicated problem is introduced and stated. The pro- 
cedure operates on a control flow graph of an application, and it provides fast 
calculation of the parallelism degree through the technique of reduction of the 
control flow graph. The R-procedure is shown to have the time complexity of at 

most O , with n and z standing for the number of processes in the appli- 

cation and parallelism degree, respectively. 



1 Introduction and Characteristic of the Problem 



In the present work, we investigate fine-grain parallel applications of a wide class, the 
applications to be executed on distributed-memory multiprocessors (DMMP) [1], [2]. 
One of the important problems that arise in treating this type of applications is to 
determine the maximum degree of parallelism in their control flow. If we were able to 
calculate this degree, we could, for example, get information on how many processors 
must be involved within a DMMP to optimally implement the application in such a 
fashion that no pair of mutually parallel modules (processes) are assigned to the same 
processor. Calculating the parallelism degree for an application, in general, requires 
exhaustive search among different combinations of processes within the application 
resulting in exponential time complexity. Thus, we need to take a different approach 
to reduce the overhead. 

We introduce a procedure for efficiently solving the indicated problem. Our proce- 
dure provides fast calculation of the parallelism degree through the technique of re- 
duction of a control flow graph of an application. We show the procedure to have the 



time complexity of at most O 



with n and z standing for the number of proc- 



esses in the application and parallelism degree, respectively. 
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2 Description of the Procedure 

To represent an application to be analyzed, we introduce a single-entry-single- 
termination control flow graph G [3]. G is a directed graph with a set of vertices V 
and a set of arcs E^VxV. Each vertex a.e V corresponds to a particular process in the 
application, and an arc ej^={a.,aJeE shows to be a direct successor to a. in the 
application's control flow. Graph G contains a vertex which has no incoming arcs, 
and a vertex a, that has no outgoing ones; a„ is called the initial vertex (the entry 
point) of G at which the application comes into execution, a. is said to be the termina- 
tion vertex (the termination point) at which the application gets dead. 

Remark 1. Hereinafter, we use the terms "vertex" and "process" interchangeably. 

Remark 2. To help the following analysis, we ignore data flows between processes. 

Remark 3. We consider general case control flow graphs in which any vertex may 
stand for a complex subgraph in turn. Say, a process denoted as a. in G may have a 
number of parallel threads whose precedence can be depicted by a separate control 
flow graph. 

Parallel applications of the considered class are known to have conditional and 
parallel branching points [4]. To incorporate these points in our application model, we 
introduce two additional sets of vertices, V® and V*. A vertex a.e V® will represent a 
conditional branching point, and a vertex a^e V* will stand for a parallel branching 
(barrier) point. In the following, we assume that V®uV* ciV. 

Remark 4. Each vertex a.e V® is a predecessor of at least two other vertices 
a^,,a„eV (i.e., (a,,a„ ),(a,,a„ ) e £) and can transfer control to either a or a . 

Remark 5. Each vertex a.e V* is a direct predecessor of at least two vertices 
(i.e., {aj,a^ Y(a .,a^^^e E) and/or a direct successor for at least two 

vertices a^^,a^eV (i.e., £ ). 

Remark 6. Hereinafter, we suppose that the initial graph G contains no cyclic paths 
(loops); otherwise we transform G by conventionally eliminating some arcs using the 
following procedure. We trace any path emanating from the initial vertex until it 
reaches either the termination vertex a, or a vertex a, that has already been passed by. 
If we have found the vertex a,, then we eliminate its incoming arc that has led us into 
a,. In the same fashion, we eliminate all similar arcs. We denote the graph obtained 
through the above-mentioned transformation as G . Let E be the set of arcs in G . 

Remark 7. The described procedure makes the initial graph have several hypothetical 
termination points (first is a„ and the others are vertices devoid of their outgoing arcs 
due to the above transformation). To avoid these extra termination points, we may 
introduce additional arcs in such a way that any parallel vertices in G are still parallel 
in G . 

To provide formal representation to our procedure, let us define the (indirect) suc- 
cessor ( 7 , disjunction S, and parallelism ;rrelations on the set V. 
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Definition 1. A vertex a. is said to be a successor of a vertex a^, i.e., a.(Ta^, if there 
exists a path connecting to a. in G . 

Definition 2. Two vertices and a. are supposed to be in the disjunction relation, i.e., 
a.Sa^, if they are covered by two alternative paths emanating from a vertex ae V® in 
G*. ” 

Definition 3. Vertices and a. are assumed to be in the parallelism relation, i.e., 
if not a.&L^ and not and not a^Ca.. 

Relation ;rcan be represented by a graph IT = (V,fr) whose vertices correspond to 

those of G, and any two vertices and a. are coupled by an edge iff a.m^. Graph IT is 
convenient to refer to as the parallelism relation graph. 

In terms of the above-defined relations, we can reformulate the initial problem as 
searching for a clique within the parallelism relation graph IT. To solve such a prob- 
lem, it is possible to employ known methods for finding a clique in a graph. In this 
case, however, we need to supply the graph IT as an input to these methods (i.e., the 
initial application is first required to transform into IT). We take another approach that 
makes no need to directly construct a graph IT, and calculates a clique for II implicitly 
through graceful reduction of the graph G . 

To provide a more formal statement to the reduction process, we specify a trans- 
formed graph G by a system of constructive expressions S = {5,}. ^ of the follow- 
ing form 

5,. =(1:R; ^R‘), RfR'^czV, i=l^ , 



where Rl and R^ are sets of vertices such that 

Rl r)R ‘2 =0,yOj e R[ ,a^e Rf. and both R[ and R[ are maximal by 

inclusion. 

It is evident that no two vertices of either R[ or Rj can be each other's direct or 
indirect successors. Thus, sets R [ , R' can be represented in the following generalized 
constructive (parenthesis) form: 



R\ = a„ 






R\=i 







) . 


( 1 ) 


Vi / 


• [x[Rf\x[Rf\. 




( 2 ) 



where Rf, = \,h, t' — \,p^ ) is a subset of vertices ( Rf c R\ ) recursively defined in 
form (1); Rf, (g = 1, r, f = 1, ) is a subset of vertices ( Rf ^Rf) recursively defined 

in form (2); xf is a logic condition that controls the transition from the vertices of /?,' 
to the vertices of Rf ; and " | " are constructive separators reflecting the parallel- 
ism and disjunction relations, respectively. 
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Figure 1 illustrates the construction of a system 5 for a given control flow graph G 
(in the figure, symbols & and © denote vertices a„e V* and a.e V®, respectively). 




5'i = (l:ao^a,), 

*53= (2 

53 = (3 : fij • fi3 ^ ), 

= (4 : ^ 

^3=(5 : —> xa^ \ xag ), 

iS'j = (6 . Gg I Gj — > G[(J 
={7:a^^ Oj), 

iSg = (8 : G, • Gio ^ Gji), 

5p=(9:Gii^g.) 



Fig.l. Representation of a control flow graph by a system of constructive expressions 
(a - graph G; b - system H specifying graph G 

We introduce the relations of strict and non-strict constructive inclusion ([c] and 
[c]) possessing the properties of the strict and non-strict inclusion (c and c), respec- 
tively. 

Definition 4. Let R- and R- be subsets of vertices represented in form (1) (or (2)). 
We assume that 

K ^ ^ R, ) A (3#i c S = 7?.^ ) , 

where is a subset of vertices enclosed by a parenthesis in R- , i.e., what we call "a 
constructive subset" of R . 

h 

Using the constructive inclusion relations gives us a formal and straightforward 
way to specify the hierarchy of control constructs in a control flow graph. 
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Let /? be a set specified in form (1) (or (2)). R can include a certain number of mu- 
tually alternative or/and parallel vertices, taking into account those enclosed by paren- 
thesis. To show the maximum of mutually parallel vertices that constitute R, we intro- 
duce a parameter referred to as "/^-cardinality" of R and denoted as . 

Now let us define the operators (rules) of the proposed reduction procedure. We 
will specify these operators in the following form: if (//; then fulfill (p, assuming (//"and 
(p are a condition and a transformation, respectively. 

f-absorption rule. If 3S-,S^e'E., 7?* a,iR^, -|^f| ’ 

we construct a new reduced subsystem of expressions , 

where 5° = {i\ R[ R‘ 2 ) ^ ^2 — (-^2 ^ subst {^R^ ,/?*); subst (/?* , ) implies 

replacing /?* with /?,* (we call this action a substitution, or an absorption); " > " 
should be interpreted as if /?* • \Rj ) , and as " | ", if /?* I (/Jj \7?f ) , and as 0, if 

= 0. 

sl-absorption rule. If 3S-,S^ e 'E., i ^ k\ R[, ^|f^ 2 | , then we 

construct a reduced subsystem =[S\{5,-,5'j.}]u|5'°}, where S° = {Jc.R'1 R^^ , 

=(< \ 77' )> subst (/?;,/?'), if 77' *(77f \7?'), ">"="|", if 

77' l(77f \7?‘), ">" = 0,if 7?,*\77‘ =0. 

Remark 8. The effect of T -absorption rule may be visualized as eliminating a subset 
of vertices ( 77, * ) from the graph and reconnecting all their incoming arcs to their di- 
rect successors ( 7?* )■ This elimination is subject to the number of parallel vertices in 

T ?2 exceeds or is equal to that in 7?j , i.e., 7?^ - • The corresponding illustra- 

tion is given in Fig. 2. The effect of sl-absorption rule is about the same, and also can 
be easily understood through exploring the example in Fig.2. 

The above-defined rules allow us to formulate the reduction procedure (referred to 
as the 77-procedure in the further) as follows. 

1. Choose a pair of expressions S.,S^&E satisfying the conditional part (//"of -l- 
absorption or T-absorption rule. If no such pair is encountered, then terminate. 

2. Transform the initial system 5 to a reduced system 5” according to the chosen 
rule. 

3. Let and go to step 1. 

The procedure continues until a non-reducible system E' specifying a hypothetical 
graph is obtained. Our thorough investigation has shown that the following theorem 
holds. 

Theorem. Any system E' resulting in the 77-procedure always contains a pair of ex- 
pressions 5, =(l:(3o^^)> Sq, = {Q' '.Q,^ Ut) , with Q, specifying a clique to be 
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found; the .^-cardinality of Q. yields the parallelism degree for the given graph G and 
the application analyzed ( z = ). 




Fig.2. Effect of T-absorption rule (a - a segment of graph, perhaps, partially transformed graph 
G; b - result in applying T-absorption rule to the given segment) 

Before discussing the proof of the theorem, we must mention that these results 
were found to be valid for those graphs which include no mutually parallel communi- 
cating cyclic and/or branching segments. For a graph with such constructs, the reduc- 
tion process may get dead before it actually yields a system S' . This effect was found 
to be a consequence of an ambiguity in the representation of the considered constructs 
by constructive expressions. The illustration to this phenomenon is given in Fig. 3. 
Flowever, the addition of an extra rule (the rule referred to as "regrouping") to our 
procedure allows us to overcome this problem. The new rule is stated as follows. 
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If 3S^,S^ e S, i ^ k\ Rf c R‘ 2 , and R ‘2 includes a constructive subset of the 
following form 

R o I... ]•... • I ... , 

where a^J czV is a subset of vertices, Rf - //i,//, , " ° "e {"•"/' | "}^ with 

R\ = [all • • • • • «.? ) I (< • • • • • «^’ ) I • • • I I".": • • • • • ) ’ 

c = 1, >v , then we presume 

S"=[E;\{5,.}]u{ 5,}, where 5,. = (i: 7?; ^ ) > ^ o[/?* If?-]), with 



7?-=( 


1 ... Iflj'hi la^'+i l<+i !■ 


... 1 ... 


l<'-i 


l...la;^' )• 








-1 l«^+i 








..laZ 







interpreted as "•", if , and as " | ", if . 

Now let us briefly discuss the proof of our theorem. Because of space constraints, 
we provide only a sketch to the proof. ' While proving the theorem, we first show that 
the reduction process according to the above absorption rules can at all start whatever 
system 5 is to be transformed. Second, we specify the general form of the non- 
reducible system for any initial system of expressions. Third, by induction, we find 
out that both the mles produce a reducible system unless a non-reducible system 
S' is obtained. And finally, by contradiction, we show that the .T-cardinality for the 
system S' can not be less than the cardinality of a clique in the parallelism relation 
graph. 



3 On the Efficiency of the Procedure 



Let us evaluate the time complexity C of the /^-procedure. It is clear that its labori- 
ousness r essentially depends on step 1 and step 2, however, step 2 requires a con- 
stant time to obtain a reduced system S”. Therefore, we can state the following 



T = 







(3) 



where |S^ | is the cardinality of the system S at a pth step (note that |S, | = |S| = Q ); g 

is the total number of steps in the reduction process; fis the maximal laboriousness of 
a rule in our procedure. 



' To see a complete and detailed version of the proof, please send a request to 
zotov @kursknet.m 
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Fig. 3. Ambiguity in representation of parallel conditional branching in control flow graphs ( a - 
a segment of graph G; b - first way to specify the segment; c - second way to specify the seg- 
ment) 

The T-absorption and Nl-absorption rules decrement the value |h| , and the regroup- 
ing rule has no effect on |b| , therefore q = |h| + ^— 2 , where is the number of re- 
groupings to be performed on the system being reduced, Each vertex of the 

initial graph G can have at most z = outgoing arcs to other mutually parallel 
vertices (otherwise we were to state Q, could not specify a clique to be found), hence 
I I d(n—\) d{n—\) 

5 and q -1-^—2, \<d<z. Denote x the total number of re- 

z z 

groupings followed by a p\h reducing step. Then |Hp| = |H| — (/? — l)-l-^p ~ 
din—\) , Xn 

— t L — — (in general, — ^ □ 1 , with x weakly depending upon the total 

z P " 

number of steps; consequently, we may assume that Xp^ ~ Zp^ =/T ’ Pi^ Pi )■ Hav- 
ing substituted the expressions for |Sp| and q into (3), and having performed some 
evident algebraic transformations, we shall obtain 
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d 



V xd"n^ dn[f -2x + e-2y^) 



where X ~ 



3z' z" z 

X', if the initial graph contains at least 

one parallel cyclic or branching segment, x' > 1; 
1 otherwise. 



It is evident that ^ = 0(1), xd^ =0(l), d{^X^ -2x + E-2y^ = 0{\), there- 



fore, the time complexity of the proposed procedure will be C = O 
Assuming n □ z , we finally attain C = 0 



— I 1 + - + — 

n n 



Z^J 



Thus, the suggested procedure requires no more than O 



Z^J 



steps to find a 



clique of a given parallelism relation graph. The cardinality z of the clique provides 
the maximum degree of parallelism in the initial application. 
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Abstract. We propose a new algorithm (ARTCP) for transport protocols of 
packet switching networks. 

ARTCP uses scheduling mechanism to smoothly transmit data flow, and does 
not cause network overload, because it considers temporal characteristics of the 
flow to self adjust to the available network capacity. ARTCP utilizes inter- 
segment spacing and round trip time measurements to control the rate of the 
flow. 

In order to study of the characteristics of the ARTCP protocol we have devel- 
oped and coded imitational programmable model, which is a universal tool for 
studying processes occurring in communication networks. Built with object- 
oriented principles, this model allows building simulation network topology of 
great complexity and setting various environments for simulation experiments. 
Results of our simulation studies, performed with ARTCP, on our model dis- 
play substantial advantages, which ARTCP has over standard TCP algorithms. 
Statistical analysis of ARTCP traffic traces yields the self-similar property of 
ARTCP traffic, which is in line with other studies of traffic traces in network 
systems. 



1 Introduction 

Communication protocols coordinate information transmission processes in distrib- 
uted systems, such as communication networks. Communication protocols form sev- 
eral levels, separated by their functionality. An ordered set of protocol layers, or a 
protocol hierarchy, forms network architecture. TCP/IP architecture [1] is one of the 
most well defined and widely used. All nodes in a TCP/IP network are divided into 
end systems (network nodes), which are the sources and sinks of information and 
intermediate systems (routers), which provide the communication path between end 
systems so that information transmission among the latter can occur. 

A two-way information flow in a network between a pair of adjacent systems is 
provided by a channel, which connects these two systems. A channel can be charac- 
terized by the rate of information flow, which can traverse the channel in each direc- 
tion (bandwidth), transmission delay and bit error probability. At each point of a 
channel connection to a router there exists a buffer, which holds a queue of data 
packets awaiting transmission via this particular channel. The buffer space and chan- 
nel bandwidth are shared resources of the network, that is, all information flows with 
a common channel have to share resources and compete for the access to them. In the 
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case when the rate of information arrival to the router exceeds the maximum possible 
rate of its departure, network congestion will result. This congestion is indicated by 
buffer overflow and data losses. 

The transport layer protocol provides reliable ordered and effective data transmis- 
sion facility between two end systems, that is, end-to-end. Two systems, communicat- 
ing using transport protocol can be considered as a self-controlled distributed system. 
The rules nodes obey, guide how senders access shared resources of the network, 
therefore transmission protocol’s efficiency defines the efficiency of a network in 
general. 

Transmission Control Protocol (TCP) [2-4] is the major transport layer protocol of 
the TCP/IP network architecture model. TCP provides reliable duplex data transport 
with congestion control between end systems. The TCP source receives information 
from its user as a bit sequence. The TCP object than chops this bit sequence to form 
finite length blocks of user data and control information attached to it (segments). 
Segments are encapsulated in network packets and put on the network to be delivered 
to the recipient. The recipient picks packets from the network, takes TCP segments 
out of them, absorbs control data and resends the reconstructed bit sequence to the 
user. 

A flow of segments between two TCP end systems can pass through an ordered set 
of routers and channels. The maximum bandwidth of TCP connection is limited by 
the minimum bandwidth of the channels, through which the flow passes. In general, 
the channels are shared between several TCP and non-TCP flows. A congestion con- 
trol algorithm, which is a part of TCP, tries to send segments out at a rate, which does 
not exceed that of the slowest channel in the path of the connection and does not 
overflow the receiver (that is, does not exceed the rate at which the receiver absorbs 
data). 

A set of multiple TCP flows sharing a common channel is a complex self- 
organizing system [5]. TCP algorithms define behavior of every TCP protocol object 
in such a system, whereas behavior of the system as a whole cannot be described in 
general by the sum of actions of all system components. Every transmitting protocol 
object tends to adapt its sending rate to an available network resource with maximum 
efficiency by cooperating with other entities in the system. 

TCP works in the following way; both sender and receiver reserve certain buffer 
space to hold segments awaiting transmission onto the network (at the sender) or 
delivery to the user (at the receiver). Each byte of data has a sequence number, which 
is unique for the connection. A segment consists of the header and payload fields. The 
header carries control information for this segment, the sequence number of the first 
byte of its data in particular. The payload field carries user data bytes. 

Sending out every segment TCP sets a timer for it. When and if the timer fires the 
segment, associated with it is considered lost and is retransmitted. By implicitly con- 
sidering congestion as only source of packet loss in the network, TCP treats each 
segment loss event as a signal to decrease its sending rate. The TCP receiver sends 
back to the sender acknowledgement with the sequence numbers of the next expected 
byte of user data, that is one larger than the last received in-sequence byte of data. 
Transmission rate of TCP sender is controlled by variable-size sliding window algo- 
rithm. The sender is allowed to send all bytes starting from the last acknowledged and 
falling within the window. While there are no loss indications, TCP grows the send- 
ing rate linearly and drops it multiplicatively when the loss event is detected. 

TCP is presently known to have a number of substantial inefficiencies: 




ARTCP: Efficient Algorithm for Transport Protocol for Packet Switched Networks 161 

1 . In order to assess available network bandwidth, the TCP congestion control algo- 
rithm constantly increases load on the network, pushing it to the saturation point, 
when packet losses signal the event of network overload to the sender. This artifi- 
cially created network congestion causes frequent packet losses and subsequent re- 
transmission of lost packets. Excessive retransmissions and high buffer occupancy 
levels lead to sharp growth of transmission delay and transmission jitter. 

2. TCP interprets most packet loss events as signals of network congestion. Thus a 
TCP sender decreases its transmission rate when data loss occurs, irrespectively to 
the reason of the loss. Such behavior leads to substantial inefficiency of TCP as 
data transport protocol for wireless networks, where may loose packets not only as 
a result of congestion. 

3. Local instabilities in TCP sender’s algorithms increase probability of packet losses, 
because the average queue lengths in router buffers are oscillating near the value 
determined by the total buffering space available. Long queues together with 
bursts in TCP transmission cause packet losses. 



2. Known Improvements of the TCP Congestion Control Algorithm 

A great number of research works were aimed at improvement of the TCP perform- 
ance limited by the shortcomings outlined above. 

These works are very interesting and provided a lot for ARTCP development. 
Among these works we would cite TCP Vegas [6], TRUMP [7], PP [8], NETBLT [9], 
Tri-S [10], DUAL [11]. 

In TCP Vegas the segment retransmission is improved by using more precise tim- 
ing and congestion avoidance mechanism is based on carefully monitored sending 
rate. DUAL algorithms derive additional hints about network congestion by observ- 
ing changes in the RTT. In Tri-S scheme the sender’s window is slowly changed to 
see what effect this will have on the throughput. The NETBLT protocol sender uses 
feedback from the receiver to decide at what rate the next buffer of data is to be 
transmitted. 

Unfortunately, none of these new methods are used in standard network systems. 
The main drawback of Vegas, Tri-S, DUAL is that they all are based on TCP algo- 
rithm, which reacts by rate decrease, when segment loss indication appears and when 
no loss is detected it drives network linearly to saturation resulting in buffer over- 
flows. TRUMP protocol dictates that all routers should use a form of explicit conges- 
tion notification, which is hard to implement everywhere. The authors of PP method 
propose a very efficient way of measuring network resource availability. Segments 
are sent in back-to-back pairs whereas separation between segments at the receiver 
determines the load on the shared network resources. PP scheme can be implemented 
only in network with separate queuing for each flow, which is hard to achieve in 
TCP/IP networks. 



3. Adaptive Rate TCP (ARTCP) 



Our task was to propose a new algorithm for transport protocol, remaining within the 
scope of TCP/IP architecture, but more efficient than TCP. New protocol should also 
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be universal by being equally usable in wireless and wired network environments 
without violating the principle of end-to-end connectivity and not requiring modifica- 
tion of internetwork routers or channels. 

In the course of this work we have: 

• Developed congestion control algorithm for the new transport protocol - ARTCP 
(Adaptive Rate Transmission Control Protocol). ARTCP utilizes timing character- 
istics of the data flow as input parameters of the congestion control algorithm and 
efficiently combines window-based flow control algorithm with individual sched- 
uling of every segment. ARTCP can gradually replace TCP in wired and wireless 
networks, preserving intermediate compatibility with the former. 

• Provided formal description of ARTCP as the code of the C-H-i- class, which models 
the new protocol. 

• Developed universal object-oriented imitational model, which allows construction 
of networks with complex topologies and simulation of the most important charac- 
teristics, which influence the transport protocols. 

We have performed a number of experiments using this model and have shown 
that ARTCP has an advantage over TCP in most use scenarios. We have also discov- 
ered self-similarity in ARTCP traffic traces with large number of samples. 

New algorithms of ARTCP compared to those in TCP, have several advantages: 

• ARTCP does not push a network to the congested state in order to find its maxi- 
mum bandwidth. Because of this, network with stable state ARTCP flows does not 
experience packet losses at all. Thus, network infrastructure usage efficiency is in- 
creased. 

• ARTCP does not interpret segment loss as network congestion indication. Due to 
this property ARTCP can be used much more efficiently than TCP in networks 
with high bit error rates. 

• ARTCP keeps average queue length of router near the minimum (one packet per 
flow on the average) because ARTCP not only adapts the sending rate of segments 
into network to the slowest service rate in this network, but also possesses an over- 
load compensation algorithm. Lower queue lengths lead to shorter transmission de- 
lays. 

• The ARTCP logic does not preclude the existence of TCP-compatible implementa- 
tions, so ARTCP can be introduced first to the end systems, where this protocol 
would be most useful. 



4. Considered Characteristics of a Transport Protocol 

In our studies of ARTCP behavior we observed the following main characteristics of 
the protocol: 

1 . The relative number of lost segments (ratio of the number of lost segments to the 
number of sent segments). 

2. The channel bandwidth utilization efficiency, (ratio of the number of successfully 
received bytes to the maximum possible number of transferred bytes): 
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bytes _ received ( 1 ) 

(channel _ bandwidth) x run _ time 

1 . The fairness of resource sharing: 

i=i / i=i 

where b. is i-th connection’s bandwidth share. 

1. The average queue length Q of bottleneck link router. (Fig. 1) 

We have provided comparison of ARTCP and TCP by these characteristics in 
similar usage scenarios. 



5. Formal Model of the System 

The network consists of several end systems, two routers, and a number of channels, 
connecting end systems to the routers and routers with each other. ARTCP protocol 
object is being executed at each end system, which are grouped into two LANs, each 
connected to one router (Fig. 1). Routers provide connectivity between two LANs by 
sending traffic via channel, which models WAN link with relatively small bandwidth 
and longer delay, than that of the channels within each LAN. 



Table 1. Model parameters and variables. 



Parameter 


Description 


S 


Segment size (bytes). 




Inter-segment interval set by the sender. 




Inter- segment interval measured by the receiver. 


RAt) 


Rate of segment departure set by the sender. 


KiO 


Rate of segment arrival measured by the receiver. 


Reit) 


Rate of segment arrival to the receiver when the sender 


learns it from acknowledgements. 


A A) 


Compensation area. Represents amount of data, accumu- 
lated in network buffers. 


Q{t) 


Router queue length. 


Q max 


Router buffer size (limits maximum queue). 


BER 


Bit error ratio. 


Speedup 


Rate growth probability coefficient. 


Slowdown 


Rate decrease probability coefficient. 


e 


Precision coefficient. Used for comparisons. 6 « 1 . 


RTT 


Round trip time. 
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sources 512 Kbps WAN receivers 

10 Mbps LAN 

Fig. 1. The general topology of the system. 



Each of the end systems in one LAN sends segments to a particular system in an- 
other LAN. Segments are sent out by ARTCP spaced in time by interval 

T^{t) = S/R,(t), ( 3 ) 

where rate (t) is controlled by the congestion control algorithm of ARTCP. We 

assume that ARTCP sources always have data to send. Senders and receivers are in 
different LANs. ARTCP receivers reply with acknowledgements, traveling in the 
opposite direction as small segments without data. However ARTCP allows for the 
same piggybacking of control data as TCP does. 

The task of the router is in forwarding segment towards its receiver according to 
the receiver’s address in the segments header. FIFO queue of segments is organized 
in the router Rl, where segments wait to be sent to R2 over WAN channel. The queue 
has finite length 

(4) 

Segment arriving at the output interface of Rl at time t is placed in queue if 

S<Q'^-Q{t), 

otherwise the segment is lost. The queue at the output interface or Rl router is served 
at the rate, determined by bandwidth of the channel between Rl and R2. 

We consider the following properties of the channels: bandwidth, transmission de- 
lay, bit error ratio. Channel bandwidth determines the rate at which bits of segments 
are accepted to enter the channel. Transmission delay characterizes length of the 
interval between acceptance of a particular bit into channel and appearance of this bit 
from the other end of the channel. Bit error probability defines the probability of 
segment loss depending to bit error probability as 

\-{\-berY . ( 6 ) 

Each ARTCP object performs as follows: 
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Network congestion for ARTCP is indicated not by lost segments, but by the tempo- 
ral properties of its flow. ARTCP source concludes that the congestion starts to build 
up when with growing (t) time RTT starts to grow and rate of the flow arrival 
rate measured at the receiver stabilizes: 

R^it) < Rsit - RTT) . ( 7 ) 

ARTCP segments are put onto the network not as a back-to-back burst within win- 
dow, as TCP does, hut spaced by time intervals T^(t) by the scheduler. Measure- 
ment of inter-segment intervals (t) at the receiver gives the rate of the flow arrival 
R^(t) . The sender stops increasing sending rate R^{t) when time RTT starts to 

increase and the value of segment arrival rate drops below the sending rate of the 
segments. These two events are indication that the system has reached the state, when 
average rate of segment arrival reaches the average flow service rate and further send- 
ing rate increase will lead to growth of the queue length in router buffers. ARTCP 
receiver returns observed values of the flow arrival rate R^(t) along with acknowl- 
edgements. Having obtained acknowledgement of a segment RTT seconds after the 
segment has been sent, ARTCP sender extracts the value of rate at which the flow 
containing this segment was delivered to its receiver by the network. The sender uses 

this information R^ (t) as an estimation of available network bandwidth. (Fig. 2) 

Congestion control and error correction algorithms of ARTCP are completely in- 
dependent, because segment loss is not taken as a sign of network overload. Retrans- 
mission does not occur immediately, but the segment in error receives higher priority 
and stays in the transmission queue to he sent first when the scheduler allows. 

ARTCP does not unnecessarily stress the network by congesting it. Unlike TCP, 
ARTCP allows for fast and efficient adaptation to available network resources. 

Due to existence of scheduler in ARTCP, it sends segments to the network more 
smoothly, avoiding hursts and consequent queue overflow. Therefore the need in 
buffer space in network routers is decreased. 

The main difference of ARTCP and TCP is in the congestion control algorithm of 
ARTCP, which sets the rate of data flow matching available resource by observing 

values of R^{t) and RTT. 



5.1. Congestion Control Mechanism of ARTCP 

ARTCP uses both the standard sliding window of variable size to prevent overflow of 
the receiver and an innovative rate adaptation function for the sending rate to match 
the available network bandwidth. All segments within the window announced by the 
receiver are sent at the rate, determined by the rate adaptation function, whose goal is 
to send segments out exactly at the rate at which they are served by the network and 
to compensate for possible overload. 

Rate adaptation algorithm has several states of operation (Fig. 3). After the start 
the algorithm tries to determine available rate as fast as possible and after that enters 
fine-tuning mode, where the rate of the flow is kept at the value of available rate. Rate 




166 I.V. Alekseev and V.A. Sokolov 



control algorithm receiver values of R^(t) and RTT at its input and depending on 
these values and its current state makes state transition and calculates the value of 
is used by the scheduler to set the length of inter-segment delay in- 
terval. 
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Fig. 2. ARTCP queues and in-object information flows. 



5.2. States of the Rate Adaptation Algorithm 

At time t the sender estimates available network bandwidth as it) and can com- 
pare the values of flow sending rate as it was at time t-RTT and flow arrival rate to the 
receiver. At time t the sender obtains acknowledgements for segments sent before 
time t-RTT. 

Fast start state (FS) has the goal to grow the sending rate of the flow from its mini- 
mal value to the value permitted by the available bandwidth as fast as possible imme- 
diately after connection initialization. In FS mode the flow rate of the sender is grown 
exponentially. The algorithm exists FS state when 

RAt,)<{\-e)^Rs{t,-RTT). ( 8 ) 

Multiplicative decrease state (MDl) follows the FS state. After the fast start ends 
the value of R^{t ) will be larger than R^ (t) because it was exponentially growing 
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previously. In MDl state the flow rate is instantaneously set below (t) . After this 
decrease the algorithm proceeds to compensation state. 




Fig. 3. State diagram of the ARTCP flow control algorithm. 

Compensation state (REC) grows the sending rate linearly up to the already known 
value of available bandwidth R^it) compensating the overload, which occurred at 
FS state. In compensation state the algorithm calculates the compensation area value 
the ABC figure (Fig. 4), formed by the values of R^(t ) over 

R^ (t- ) during the time when 

in FS state. The meaning of {t. ) is that its value represents the amount of excess 

data accumulated in router buffers while the sending rate of the flow exceeded avail- 
able network bandwidth. The sending rate in REC state grows linearly in such a way 
that amount of data sent into network is exactly ) less then the amount which 

would be sent if sending rate were equal R^ (t- ) . This condition geometrically is 
explained as equality of surface area of CDF triangular and A^(f,.). State REC is 
terminated when 

R,{t)>Rjt). ( 10 ) 



The fine tuning (FT) state follows the REC state. In FT state the sending rate of the 
flow slowly adapts to the available bandwidth of the network. Relation of speedup 
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and slowdown coefficients determines rate increase or decrease probability at each 
clock tick. Speedup coefficient, which represents the rate increase possibility, is in 
reverse proportionality to current rate of the flow. Slowdown coefficient, which 
represents the rate decrease possibility, is proportional to the ratio of measured RTT to 
minimal observed value of RTT. Speedup is thus smaller for faster flows and is larger 

for smaller values of R^(t) , which helps slow flows to achieve larger relative share 

of bandwidth. The value of slowdown is equal for all flows and grows with RTT. The 
FT state gives slower flows possibility to increase their rates and causes all flows to 
decrease rates equally, when RTT grows - that is queues start to build up. The algo- 
rithm leaves FT state, when a sharp variation of measured RTT occurs. 

Multiplicative decrease state 2 (MD2) is needed for fast rate decrease, which is 
triggered hy any sharp growth of the measured RTT. Following the decrease, which 
may be caused for example by failed network link and routing traffic over slower 
link, the algorithm reenters FT state. No compensation is needed, because rate was 
not growing fast in the previous state. 

The rate adaptation process of the single ARTCP flow to the network bandwidth of 
96 kbps is shown on figure 4. 



6. Simulation Model 

In order to model ARTCP behavior and compare it to TCP we have developed and 
implemented imitation programmable model (IPM) of ARTCP itself and of all net- 
work components, which determine protocol functionality. 

IPM consists of a set of ARTCP protocol objects and all network elements, which 
influence the behavior of congestion control algorithm. IPM is build as a network of 
the required amount of interacting objects, arranged in a particular topological 
scheme. ARTCP protocol object and objects representing all other elements of the 
network are implemented as C-H- classes. The IPM is universal because the set of 
main objects it contains: node, router, link can be used to build model of any network, 
while individual settings of each individual object allows to set any possible imitation 
scenario. 



6.1. Object Structure of the Imitation Programmable Model 

The network being modeled is constructed of any number of end systems, where 
protocol objects of ARTCP, TCP or constant bit rate are executed, channels and 
routers. Segments, generated by the active sources, travel the modeled network pass- 
ing through its elements towards the receiver. Each node in the model has a unique 
network address and segments carry source and destination address fields in their 
headers. All objects of the model create logs of their states and events, which are then 
analyzed to study the dynamic behavior of the model. 
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Fig. 4. Adaptation process of the single ARTCP flow to 96 Kbps network bandwidth. 

6.1.1. The ARTCP Protocol Object 

The object performs initial synchronization of the connection (connection establish- 
ment), rate adaptation, generation and scheduling of segments, reception of data from 
network and generation of acknowledgements, timer based retransmission and fast 
retransmission [12]. Two ARTCP objects are capable of simultaneous data exchange 
in both directions. Internal structure of ARTCP class is relatively complex (fig. 2). 

6.1.2. Constant Bit Rate (CBR) Protocol Object 

This protocol object attempts to send segments into network at configured constant 
rate without flow control or lost data retransmission. This protocol is used to model 
multimedia and UDP data flows and study their coexistence with ARTCP. 

6.1.3. End System Object 

Objects modeling end systems are used as platforms to run ARTCP, TCP or CBR 
protocol objects. Each end system has unique address within the model. This address 
has to be set in segment header for the segment to be delivered to a particular node. 
The end system functions are in passing interrupts to the active protocol object on this 
system, passing segments between the channels and the active protocol objects. 

6.1.4. Router Object 

We chose to model a router as an internetwork device with non-blocking switching 
matrix and output buffering. This is a good model of a contemporary Internet router. 
The router object of composed of several other objects: interfaces (one per each link 
the router is connected to) and one switching matrix object, which interconnects inter- 
faces. 
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The router has to switch segments to an appropriate output interface according to 
the destination address in the segment header. Each interface maintains a FIFO queue 
on the output side, where segment awaiting transmission out of this interface are held. 

6.1.5. Channel Object 

The channel object is used to connect end systems and routers in our model. Each 
direction of a channel is independently characterized by a certain bit rate, transmis- 
sion delay and bit error probability. 



7. Modeling ARTCP 

We have performed a number of simulation studies with the ARTCP protocol. 

The goal of our modeling experiments was to determine the important properties 
of ARTCP (see section 4). We also needed to compare ARTCP and TCP perform- 
ance. Multiple experiments were run in each of several scenarios, where two FANs 
were connected via limited bandwidth WAN channel. Each channel within a FAN is 
characterized by the bandwidth of 10 Mbps, 0.01-second delay and zero bit error 
ratio. Through varying scenarios we simulate several simultaneous ARTCP flows and 
a CBR flow, competing for shared network resources. 



7.1. Isolated Flow 

In order to observe the details of ARTCP rate adaptation process we used a scenario 
with isolated ARTCP flow through WAN channel with 96 Kbps bandwidth and 0.1- 
second delay. Maximum queue length in R1 router does not exceed 16 Kbytes. Fig. 4 
depicts the plot of flow sending rate versus time. No segment loss occurs in this sce- 
nario. 



7.2. Two ARTCP and One CBR Flow 

For two ARTCP flows coexisting with CBR flow we needed to check the correctness 
of ARTCP algorithm. In experiments of this scenario we randomly choose the 
start/stop times of the flows and CBR rates. We used the topology of three source- 
destination pairs and two routers, connected by 256 Kbps channel with 0.1-second 
delay. 32 Kbytes limit output buffer of router Rl. In every one of every 100 simula- 
tion runs of this scenario the first ARTCP flow starts at time t = 0, moment of start for 
the second ARTCP flow and CBR flow are chosen at random from intervals 10-110 
and 190-210 seconds respectively. The stop time of the first ARTCP flow is also 
taken at random from interval of 390-410 seconds. The CBR rate is randomly se- 
lected between 50-200 Kbps. We obtained the following results in this scenario: for 
two ARTCP flows sharing the channel with CBR flow link utilization 
U = 0.981 ±0.012 ; for two ARTCP without CBR flow U = 0.971 ±0.023 ; the number 
of lost segments in all experiments equals zero; for two ARTCP flows with CBR flow 
the fairness of resource sharing f = 0.989 ± 0.01 1 ; without CBR F = 0.97 ± 0.028 . 
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Fig. 5. Segment sequence graph for two ARTCP and one CBR flow. 



7.3. Comparison of ARTCP and TCP 

In order to study the influence of segment loss not caused hy network overload on the 
protocol performance we set up simulation, where 10 transport protocol traffic flows 
were sent through 256 Kbps channel with 0.1-second delay and different values of 
BER. We ran this simulation for both TCP and ARTCP. For each of the values of 
BER (up to 6x10"^) there were 50 runs, each lasted 500 seconds. All flows started 
simultaneously at t = 0. 

As our simulation suggests, the ARTCP has a clear advantage over the TCP, be- 
cause under the condition of growing BER, the ARTCP flow rate remains nearly 
constant, whereas the TCP rate goes down sharply. Figure 6. shows the plot of aver- 
age flow rates (averaged over 50 runs) with mean square deviation vs. BER values. 

In the next scenario we compared bandwidth utilization coefficients and fairness of 
resource sharing for ARTCP and TCP. For each of the protocols 100 simulations 
were run, each lasted 500 seconds on 10 variants of network topology, containing 
from 2 to 20 end systems and from 1 to 10 simultaneous flows. 

With small number of active flows link utilization is slightly better for TCP. As the 
number of active flows goes up, link utilization by ARTCP approaches 1 , while TCP 
link utilization starts to deteriorate due to retransmissions (fig. 7). ARTCP flows are 
fairer in sharing resources between themselves and the fairness coefficient F grows as 
number of flows increases. 



7.4. Self-Similarity of ARTCP Traffic 

Experimental studies of TCP/IP traffic [13] have shown that assumption of limited 
variance of segment interarrival times is invalid and this questions the applicability of 
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the queuing theory to such systems. In their classical works [13-15] V. Willinger and 
M. Taqqu have shown that TCP/IP network traffic is characterized by self-similarity 
property. 




BCR 



Fig. 6. Throughput of ARTCP and TCP vs. bit error ratio. 

At present, the theoretical apparatus of self-similar processes analysis is in its early 
stage of development and lacks well-studied theoretical models, which could be ap- 
plied to systems with network traffic. That is why, we believe, that simulation ex- 
periment is the main tool for studying such systems with network traffic. 

In order to find out, whether traffic trace is characterized by the property of self- 
similarity, Hurst coefficient is computed over a large volume of samples. To perform 
such computation of Hurst coefficient for ARTCP traffic we ran simulation to obtain 
a large number of samples. In our case it was a series with 147036 measurements, 
each of which is a sum of segment arrival events on R1 router from 10 active ARTCP 
flows over periods of 0.1 second. 

The series was than subject to statistical analysis using rescaled adjusted range 
(R/S) and aggregated variance methods. Results of application of both methods were 
used to calculate Hurst coefficient, R/S method yields 0.63, aggregated variance 
method yields 0.65. Thus we have shown that ARTCP traffic, just as other network 
traffic [14, 15] is self-similar. Existence of self-similarity in traffic traces, which were 
obtained by our imitation programmable model, is a good validation of the latter. 



8. Conclusions 

We have given description of the algorithm for transport protocol, which uses sched- 
uling to smoothly transmit data, which does not cause network overload, by utilizing 
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inter-segment and RTT time measurements to control the rate of the flow. We have 
described the algorithm of this protocol and created a model implementation as C++ 
class. 




number of connections 



Fig. 7. ARTCP and TCP link utilization coefficient vs. number of connections. 

Results of simulation studies performed with ARTCP on our imitational program- 
mable model display substantial advantages, which our new protocol ARTCP has 
over standard TCP. 
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Abstract. ParJava is extension of Java environment by facilities to support 
developing of effective scalable portable object-oriented parallel SPMD- 
programs for homogenous and heterogeneous computing systems with 
distributed memory. ParJava model is determined by four Java-interfaces, 
which support the notion of computing space as JavaVM network. ParJava 
allows to perform parallel programs designed for homogenous system on 
heterogeneous ones without lost of scalability. These facilities provide base 
supporting design of high-level object models of parallel programming. 



1 Introduction 

SPMD-program development and implementation capabilities for multiproeessor 
distributed computer systems are discussed. The goal consists in design of useful and 
effective tools supporting development of portable scalable parallel programs running 
on such systems. Parallel computations in distributed systems are usually executed 
using processes together with message passing. 

Standard high-performanee communication equipment (Fast Ethernet, Myrinet 
ete.) that has recently entered computer markets brought to life computer clusters with 
distributed memory. Their advantage consists in that they are built on basis of 
standard prevalent hardware and therefore it is possible to use standard software (e.g., 
OS Linux, various implementations of MPI and other appropriate software both 
commercial and free). 

There is an opinion that computers having various performanee and/or architeeture 
may be united in elusters as well. Sueh systems are referred as heterogeneous clusters 
[1]. We will call such systems heterogeneous computer networks (HCN). In particular 
we have such situation when cluster hardware is partially modified or local computer 
network (workstations, personal computers etc. connected in network) is used as a 
cluster. 

Distributed memory parallel program design needs special languages facilities. 
Usually SPMD programs are designed using sequential programming languages 
(Fortran, C, C-H-i-) together with standard message passing interface MPI. When 
SPMD programs to be executed using HCN are designed, speeific problems due to 
heterogeneity of computer network arise. To solve these problems some new 
programming language facilities are needed. MPI interfaee does not provide sueh 
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facilities being developed to support message passing through homogeneous 
networks. 

The collection of such facilities is implemented in ParJava environment [2] . Being 
an extension of standard Java environment ParJava supports design, implementation, 
execution, and modification of portable scalable SPMD-programs running on 
homogeneous and heterogeneous networks of JavaVM. It means that ParJava allows 
to execute parallel Java programs using supercomputers, homogeneous and 
heterogeneous clusters, local computer networks, as well as virtual computer 
networks, utilizing free Internet resources. ParJava supports the following two 
approaches for implementation, porting, and execution of Java SPMD-programs: 

• running one JavaVM on each computer of distributed computer system 
(homogeneous or heterogeneous), which results in homogeneous or heterogeneous 
network of Java-executors, 

• simulation of homogeneous network of Java-executors on heterogeneous computer 
system by running several JavaVMs on more productive processors to achieve load 
balancing. 

The last approach supports the design of parallel programs for homogeneous parallel 
computer system (say, supercomputers or homogeneous clusters) using local network 
of personal computers and/or workstations. 

Main disadvantage of Java environment is comparatively low productivity of 
virtual processor. However, JIT-compilers integrated with JavaVM allow to obtain 
highly optimized object code, which is executed on the same speed as Java-programs 
compiled in native code by corresponding Java native compiler. The speed of 
optimized native programs obtained from Java source code is close to that of C/C-H- 
programs (loss in speed is about 1.5 times). 

It is necessary to point that even Java Native Compilers produce object code, 
which works slower than that received from C/C-H-. The point is that Java-program 
(unlike C/C-H-) remains object-oriented during execution time. It makes Java-program 
more flexible and robust but everything has its cost. 

ParJava environment consists of 

• language facilities: the set of Java interfaces (implemented in several packages), 
which extend Java environment by means supporting SPMD programming, 

• analyzers and other tools that help programmer to develop and implement high 
quality scalable SPMD Java-programs. 

According to requirement avoiding any changes in Java language as well as in 
JavaVM no new language constructs were added to Java: all extensions supporting 
parallel programming are introduced by interface and class libraries. 

In current paper the basic level of ParJava is presented. No additional assumptions 
about parallel programs are made. Java packages (methods) implementing basic level 
environment support low-level data parallel programming for distributed computer 
systems: all decisions about parallel computations are made by an application 
programmer and implemented with the help of ParJava facilities. Implementation of 
higher-level models is beard on basic level and usually uses additional assumptions 
about computer network's topology, hardware and software properties (e.g., using 
DVM [3] or HPF model it should be assumed that parallel computer network has one, 
two or three dimensional grid topology, other high level models will be based on 
some other assumptions). 
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2 Parjava Interfaces 

Basic level ParJava model is determined by four Java-interfaces, in which the notion 
of computer space is introduced as JavaVM network executing SPMD program. 
JavaVM network is represented by a weighted graph: weight of each node is equal to 
productivity of corresponding JavaVM, the weight of each arc is equal to capacity of 
corresponding channel. 

IJavaNet interface specifies methods of creating JavaVM networks and their 
sub-networks as well as some network transformations. We call JavaVM network any 
enumerated set of JavaVMs that are able to exchange by messages. No assumptions 
about homogeneity (or heterogeneity) of network are made on this level. Each 
JavaVM sub-network also is treated as network, has parent network (sub-network), 
and may have arbitrary number of child sub-networks. If JavaVM network has no 
parent network it is called root JavaVM network. IJavaNet interface defines 
methods that allow to create network (sub-network) of Java-executors using nodes of 
current network, receive the number of nodes (or free nodes) of current network, 
receive current number of each node of current network, pass to parent network, 
create new network as union or intersection of current network with some additional 
network. JavaNet class (implementing IJavaNet interface) may be used to 
create a computer network (which is assumed to be homogeneous) by starting exactly 
one Java-executor on each node of parallel computer system. 

IFullGraph interface specifies operations on full weighted graph. 

Interface INet Properties allows to specify some properties of heterogeneous 
Java-net, which are used when optimal homogeneous or heterogeneous Java-net is 
modeled using given heterogeneous computer network. This interface extends 
interfaces IJavaNet and IFullGraph. It also defines methods allowing to 
specify SPMD program special requirements to the topology of computer network 
(e.g., star, grid, line, etc). 

When SPMD-program is executed on heterogeneous JavaVM net, optimal use of 
system resources may be achieved using non-uniform data distribution. Interface 
IHeterogenNet provides methods that calculate relative productivity of any 
specified node, create heterogeneous network executing exactly one JavaVM on each 
node, determine and delete all nodes that do not correspond to scalability conditions, 
optimize the network according to program requirements, scatter parts of transmission 
buffer of current node to reception buffers of remaining nodes having various sizes, 
gather data in reception buffer of given node from transmission buffers of remaining 
nodes. 

When the optimal network of Java-executors is created productivity of each node, 
measured using appropriate benchmarks (http://www.nas.nasa.gov/Software/NPB/, 
http://netlib2.cs.utk.edu/benchmark/linpackjava/) should be accounted, as well as 
program requirements to topology of target network. In cases when execution times of 
sequential and parallel parts of SPMD-program are comparable, it is necessary to 
estimate the time needed to execute the sequential parts. 

Interface IHomogenNet defines methods supporting creation of homogeneous 
network of Java-executors on given heterogeneous computer network. 




178 A. Avetisyan, S. Gaissaryan, and O. Samovarov 



3 Par Java Environment 

ParJava environment allows to edit, debug, and execute parallel programs using 
homogenous or heterogeneous computer networks (supercomputers, elusters, or local 
networks of workstations and/or personal computers). This environment also enables 
to model homogeneous computer networks on heterogeneous ones. 

A user interface to ParJava environment is started on one of nodes of a parallel 
computer network (referred further as “root”). When a list of available nodes is 
displayed, the “root” node is marked by the “root” word. 

The “Tools” item of the main menu provides an access to ParJava facilities 
supporting choice of the network and compilation of Java-program. “Tools” submenu 
provides the following items: “New Net” allocates new subnet using dialog box, 
which contains the list of available nodes; a parallel program will be executed on the 
network consisting of nodes marked in the list. “Hosts Performance” defines relative 
nodes performances of the allocated network. “Compile” compiles SPMD-program 
using Java compiler from the JDK 1.2 environment. “Run” executes parallel program 
on the “root” node in the sequential mode. “Run on Homogeneous Net” executes 
SPMD-program on a homogeneous network of the JavaVMs, which is modeled on the 
current network. Number of nodes is defined by the user (in automatic mode it is 
determined to be optimal). “Run on Heterogeneous Net” executes an SPMD-program 
using heterogeneous network of the JavaVM (one JavaVM is started on each node). 
The nodes, which don’t satisfy scalability condition, are eliminated. 

When a sequential program is being tested and debugged, it is necessary to provide 
not only its correctness, but also its efficiency, stability and scalability. For this 
purpose it is useful to know some properties of the program (profiles, traces, slices, 
etc.). An effective distribution of SPMD-program to nodes of a heterogeneous 
network demands the knowledge of the program parameters, which define actual 
speed of the program execution on each node of the network. 

The tools allowing to determine these parameters are collected in the “Analyzers” 
submenu of the main menu, which provides the following items: “Instrumentate” 
inserts system debugging calls in SPMD-program. “Profile” determines a dynamic 
profile of a parallel program. These two tools enable to take into account a weight of 
the sequential part, when a performance of nodes is calculated. “Test_mode” allows 
to translate a parallel program in the mode using debugging library providing 
collection of a history of parallel execution of each branch of parallel program in 
special file. Parallel execution trace is represented as partially ordered set of events. 
An absolute time of each branch execution and times between parallel events in a 
branch are presented. The special file contains also additional information about 
events (a size of the event, a source and a target of the event, etc.) User can visualize 
the stored history traces by means of the “TraceVizualization” item of the menu. 
“TraceVizualization” tool displays each parallel branch of SPMD-program as a 
horizontal line. Calls to the communication library are marked on this line by small 
squares. A temporal bar is displayed on the top of the image. A time of 
communication function call can be defined in milliseconds by using this bar. 
Communication functions on the image are represented by integer numbers, printed 
within a small square on diagram. For example, number one may represent Init ( ) 
function, while number two - Finalized function. When the square with function 
number is covered by cursor, a hint appears. It contains a name of a communication 
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function and a relative time of it’ s execution. Some of squares may be connected by 
green lines, which show that a branch of parallel program invokes a communication 
function and an expectation for external event is started to complete this function. A 
length of the green line is proportional to waiting time. If a button on the left side of 
the image was clicked by the cursor, the idle time of each processor will be calculated 
(in milliseconds) and shown on a diagram. To get the diagram of all processors it is 
necessary to click “All” button. The result of program profiling is stored 
automatically into system files accessible to users by the “Trace Vizualization” tool. 



4 Example 

An example of use of ParJava is Java version of parallel scalable C-program designed 
for homogeneous parallel computing system (Parsytec GC). The program was 
converted from C to Java without modification of algorithm. Fig. 1 shows speed-ups 
of program execution for heterogeneous computer network, which consisted of Intel 
(Linux) and Sparc (Solaris) platforms for the following three cases: 1) ignoring 
heterogeneity of the network (gray curve); 2) modeling optimal homogeneous 
network (dotted curve); 3) launching one JavaVM on each computer (black curve). 




0,5 

0 

0 1 2 3 4 5 

Network total output 
(1=performance of SparcUltral) 

Fig. 1. Results of example executions. 



5 Related Works 

Possibility of performing effective parallel computations using heterogeneous 
computer networks is widely discussed in periodic. WINPAR system [4] is one of the 
successful attempts to solve this problem. It provides an integrated development 
environment for local area networks of personal computers operated by Windows NT 
message passing supported by MPI and PVM. It provides a set of tools for parallel 
program development, simulation, performance prediction, graphical high-level 
debugging, monitoring, and visualization. Though the system provided very 
convenient user interface it failed by reason of ineffectiveness of heterogeneous 





180 A. Avetisyan, S. Gaissaryan, and O. Samovarov 



parallel computations. There are several other Java environments for homogeneous 
and heterogeneous SPMD programming. We can mention Towards [5] system, which 
adds to Java new language primitives - parallel arrays. It makes Java parallel 
programs more efficient hut removes system out of Java framework. DOGMA [6] is a 
metacomputing environment, which allows use of heterogeneous clusters. A key 
feature of DOGMA is its ability to operate as an application server for parallel 
applications. Project Dome addresses the problems of load balancing in framework of 
heterogeneous multiuser environment, ease of programming, and fault tolerance [7]. 
mpC [8] is a high-level parallel language (superset of ANSI C), designed specially to 
develop portable adaptable applications for heterogeneous networks of computers. 



6 Conclusion 

Development of sample SPMD programs in ParJava environment approved that it is 
suitable tool supporting effective scalable parallel programming. ParJava is used for 
design high-level object models of SPMD programming. Main advantage of ParJava 
is the possibility to execute parallel program without any modification or 
transformation on various scalable computing systems (portability). 

ParJava environment is in state of evolution. The full-scale debugger of parallel 
programs is being developed. Some new analyzers are designed. 
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Abstract. Neurocluster based on NM6403 neuroprocessors architecture, system 
software and programming technology are discussed. Special attention was paid 
to operating system structure, data and control flow between subsystems, 
internal data structures, system topology, programming language and general 
parallel programming ideas. 



Introduction 

Neurocluster based on NM6403 is a part of a whole in general heterogeneous network 
cluster. System is developed for SPMD (Single Program Multiple Data) -tasks, but it 
is possible to execute MIMD (Multiple Instruction Multiple Data) - tasks. Such 
system has to support data distribution, branch synchronization and fast 
communications between subtasks [1]. This research is supported by Russian 
Foundation for Basic Research (project JSh 00-07-90300). 

NM6403 microprocessor was developed hy RT Module [2] and current 
programming technology is uncomfortable for parallel programming. The main 
reason of this is using low-level language [3]. There is a high level programming 
language C-H-, but it does not allow using vector instructions. Programmer has to 
compose code for parallel processing, communication organization and other 
supporting code. 

Parallel computing systems need high-speed mechanisms for transferring data and 
system messages between processes. Every processor has two such links with transfer 
rate of 20Mb/sec. Processors are connected by ring topology using two rings with 
opposite transferring directions. As an alternative topology we use star topology with 
communication device in the center. Communication device is based on TMS320C40 
signal processors and has memory accessed from neuroprocessors connected to the 
device. Second communication mechanism is a transferring through PCI or Compact 
PCI buses, but it is not the fastest way because it depends on number of processors. 

There are similar systems [4]. Philips produces Lneuro chip, which consists of 16 
processor elements with 16 bit registers. Each processor element can work as 16 Ibit, 
8 2bit, 4 4bit, 2 8bit or 1 16bit processor element. 

Hitachi developed Wafer Scale Integration. Every wafer contains neural network 
with 576 neurons. Neuron has 64 8bit weight. 
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NM6403 has weight matrix 32x64 hit. Weight hit length varies from 1 to 64. But 
this processor is not only neurochip, hut DSP (Digital Signal Processor) too. 



1 Operating System 

Operating system has module architecture and can he logically separated into 

— Global task manager (GTM), which executed on Intel [5] 80x386 compatible 
microprocessor. There is only one copy of GTM in system. GTM is a central 
control subsystem, which monitor and communicate with other system modules. 

— Local task managers (LTM), which executed on each NM6403 microprocessor. It 
monitors tasks executed on local processor element and communicates with 
another LTMs and GTM. 

— Remote operating system console. It serves as a communication and control tool 
between user and GTM. 



Global Task Manager 

Loading of GTM is the initial procedure for starting of operating system. GTM 
provides following services: system modules initialization, control and 

communications with another modules, collision detection, modules unloading and 
system shutdown. 

System modules initialization includes: GTM loading and system tables creation 
(resource table, processes table, messages queue, etc.), LTM loading and establishing 
connection with console. 

Messages can be of the following types: informational, control and packet 
messages. 



Local Task Manager 

LTM provides services for user tasks (communicational, computational, control, etc.). 
Another function is the communication with GTM and LTM running on connected by 
links processors. LTM switches between running processes to provide multitasking 
environment in real time mode. 



Dynamic Resource Management 

Programmer separates code into branches during writing parallel programs. If 
program is SPMD task then total number of branches is defined. Programmer doesn’t 
know how many processors and other resources will be available during program 
execution. Operating system has to provide needed number of processors. 

If system contains N processors, SPMD task needs M processors, N<M, and 
branches don’t produce communications to each other, operating system executes N 
branches and places (M-N) branches to execution queue. But if branches intensively 
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interact then branches should be executed in parallel. In this case some branches will 
wait needed resources (e.g. synchronization or communication) and will be ’’slept” by 
operating system. ’’Slept” processes are placed to execution queue. 

When process is started it placed in execution queue and its state is starting. If 
system has unused processor and in execution queue there is only one waiting process 
then it will be loaded to the free processor (GTM sends branch code and supported 
data to the LTM). In the case when execution queue contains more waiting processes 
than number of free processors, execution queue is analyzed. Every process has its 
priority value (user can set these values to processes). These values are used to choose 
the next process from queue for execution (see Table 1 for example). 



Table 1. Example of Execution Queue 



Priority of waiting process 
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Execution sequence 
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Queue growing 





Processes are taken from the execution queue when computing resources are 
released or some process is slept. 



2 Programming Technology 

Programming Language 

Programmer can divide code on branches, which will be executed on separated 
processors. Some branches can be executed in parallel. Others should wait for some 
event. It depends on logic of task. Following construction shows how branch can be 
declared. 

branch proc_type branch_name ( parameter_list) { body 

branch is a keyword, which allow to generate special code for load, unload and 
communication. Proc_type sets type of microprocessor, which allow to execute 
branch; it can be nm for NM6403. Branch_Name is a unique identifier of branch. 
Parameter_list is a list of parameters, which passed to branch; body - is the set 
of operators and commands. 

When OS loads branch, process is created, which consists of set of segments: code 
segment, data segments, stack segment etc. Parameters passed to branch and results of 
calculation are pushed to stack. When process is terminated, result is returned to main 
program. 

Also parameters and results can be passed with help of MPI [6] (Message Passing 
Interface) or MPI-RT [7] (Message Passing Interface - Real Time). 

Branch can be loaded by branchstart routine. Declaration of this function is 
following: 

handle branchstart ( branch, num_param, param_list) ; 
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branch - identifier of branch , num_param - number of parameters, 
param_list - list of parameters. Function branchstart returns the process 
descriptor (ID). 

Branchwait routine is used for barrier synchronization. Parameter of this 
function is a list of process descriptors terminated by 0 (null). 

Function branchkill terminates execution of process. Process descriptor is 
passed as a parameter. 

For running SPMD-task, which consists of num_branches branches with the 
same code and different data, next function is used: 

handle SPMD_start (num_branches , branch_name, 
num_param, list_param [dist] ) ; 

branch_name - branch identifier, num_param - number of parameters, 
list_param - list of parameters, dist - kind of data distribution. Function returns 
descriptor of SPMD-task. 

For barrier synchronization with termination of SPMD-task following routine is 
used 

SPMD_wait (handle ID) ; 

ID - descriptor of SPMD-task. 



Types of Data Distribution 

Branches of SPMD-task share data. That is why we need to set method of data 
distribution between branches. There are following types of data distribution: 

1. Cyclic type distributes data between branches one by one (see table 2). 

2. Block type distribute data between branches by dividing original data amount into 
equal blocks. Number of blocks is the same as number of branches. 

3. Block-cyclic type is similar to cyclic distribution but data are distributed not by 
one but by block. 



Table 2. Example of cyclic data distribution between 6 branches 



Branch 1 


Branch 2 
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Branch 4 


Branch 5 


Branch 6 
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Commutation 

During the first stage of OS staring detecting of configuration is performed. OS loader 
sends test message to first processor, which was found. When processor receive 
message, it ’’remembers” sender and sends test message with help of own links to 
another processors. This procedure repeats while there is one (or more) processor, 
which has not received the message. Then message is returned back and every device 
adds own identifier and number of link to the end of message. Received sequence is 
analyzed to build topology of system. 





















Mechanisms of Parallel Computing Organization for NeuroCluster 185 



In that case when there is no special commutation device, communication is 
performed by using host-computer. It means that message is sent to host-computer 
and then host-computer sends message to target processor. This method of 
communication is slow and can be used when communication is performed rarely. 

If there is special commutation device in the system, then all messages are sent to 
that device and later to target processor or another commutation device. 

When program is compiled compiler checks data requests, which are placed in 
RAM of another processor. In that case compiler adds code for accessing to that data. 
Host-computer (or commutation device if it present) has table of data location, which 
are used during search of data. 



Conclusion 

Main operating system modules were coded and tested. These modules are Global 
Task Manager, Local Task Manager and system console. Technology of 
programming is ready. 

Next part of work will include development of communication libraries, system 
drivers, C and Fortran [8] compilers, system tools and such applications as image 
processing, image recognition and maps processing (e.g. fingerprint recognition). 
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Abstract. Language for the description of tasks graph informational-control 
structure is offered. The tasks graph nodes can represent the SPMD-applications 
in network cluster. The graph supposes branching, cycles and parametrical 
adjustment. The questions of organization of dynamic cluster resource 
management are considered in the process of the tasks graph realization. The 
example of the description of a tasks graph, demonstrating the basic capabilities 
of language and system, is given. 



Introduction and Problem Statement 

A lot of the software for network cluster systems is known: MPI, PVM, mpC [6], 
Linda [6], T-system [1], Condor [4] and many others. There are many mechanisms of 
synchronization and parallel processes communications in this software. Such 
systems, known for the authors, usually do not offer dynamic resource distributions, 
or impose some restrictions. Dynamics in parallel calculations organization allows to 
optimize system resources utilization, provides high level of reliability and virtually 
infinite resources. Also dynamics provides additional security in network. 

In our works [2,3,5] cluster network system security problems, questions of remote 
subtasks launching system realization, their synchronization and data exchange 
between them were considered. This paper is devoted to task graph description 
language, which sets sequence of subtasks execution and data exchanges between 
them. The language also allows easily describing of SPMD-tasks. 

MPI, PVM, mpC are based on Unix concept of processes existing simultaneously 
and having identical codes with several branches. Programming technology coincides 
with sequential tasks programming that results in hardness of complex parallel tasks 
programming. 

T-system, Linda and Condor cluster system are closest to offered development. 

T-system gives an automatic dynamic programs paralleling capability. It is a 
program system, using dynamic tasks (functions) distribution. In T-system the 
functional programming principle is applied. The functional programming imposes 
some restrictions on programs creation methods (for example each function has 
several inputs and only one output). T-system works on Unix-compatible OS ’s. 
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Linda is similar to T-system, is based on shared memory ideology and has not 
suitable mechanism of dynamic resource allocation [6] . 

Condor is a system for dynamic distributed calculations organization. It provides 
tasks migration mechanism in case of machine failure and dynamic task creation. 
Condor does not allow or allows with restrictions a task structure setting: sequence of 
execution, task synchronization, data exchanges. 



Task Configuration Language 



So, the tasks in the system are described by means of a configuration language, which 
actually sets a graph of informational links inside a task. The task consists of suhtasks 
or graph nodes. Each suhtask is an executable file obtained by means of traditional 
sequential programming (C for example), which represents a "black box", having the 
inputs and outputs links. 

There two basic inter-subtasks links: after death (actual after task completion) and 
lifetime (used during task processing). The links are unique system objects. The data 
transmitted via links, have no type and are represented by a byte raw of information. 

Actually task structure is defined by its data links. It is one of the main system 
novelties. Links inside a task exist all time while the task is processing. They are 
independent from nodes, which are processed and then finish. Links and subtasks 
readiness functions define conditions when subtasks have to be started. 

When it is necessary to start a subtask, 
system chooses a minimally loaded 
workstation in the cluster and sends it 
command to start an executable file of the 
subtask. 

Readiness functions are Boolean 
functions over combination of subtasks 
inputs. Subtask input has a true state when 
it contains data. If readiness function of 
some subtask is true and subtask is not 
processing, system starts this subtask. All 
subtasks have readiness functions. If 
readiness function is not defined for 
certain subtask in task configuration 
script, then it is supposed, that inputs 
readiness function is AND over all 
subtask inputs. 

A text below describes graph, which is 
shown in fig.l and shows main 
capabilities of the language. 

//Task name and its input 

Task "sample" in tifilel, tifile2 out tofile; 
const { n= 4;} //Constant description and definition 
//Task input and output data description: 




Fig.l 

and output parameters: 
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datajtif ilel "file path"; tifile2 "file path" ; 

tof ile "file path" ; } 

//Subtasks types description: 

node_types { stskl "executable file path" in iparll, 
iparl2 out oparll; 

stsk2 "executable file path" in ipar21, ipar22 out 
opar21, opar22; 

stsk3 "executable file path" in ipar31 out opar31; 
stsk4 "executable file path" in ipar41, ipar42 out 
opar41, opar42; 

stskS "executable file path" in iparSl out oparSl live 
IparSl ; 

stsk6 "executable file path" in ipar61 out opar61;} 
//Graph description: 

graph{ stskl si; //Nodes variables definition 

stsk2 s2;stsk3 s3;stsk4 s4; stskS stsk5m[N]; stsk6 s6; 

//Readiness functions definition: 

fready s2 (ipar21 | ipar22 ) ; fready s4 (ipar41 | ipar42 ) ; 
//Connecting links: 

connect tifilel , si . iparll ; connect tif ile2 , si . iparl2 ; 
connect si . oparll , s2 . ipar21 ; 
connect si . oparll , s3 . ipar31 ; 
connect s2 . opar21 , s4 . ipar41 ; 

connect s4 . opar42 , s2 . ipar22 ; //Making a cycle 
//Merge a number of links in one link: 
merge s4.ipar42 s2 . opar22 , s3 . opar3 l;link t [N] ; 
//Distribute data from one link on array of links: 
dist s4 . opar41 (cyclic (100 ) on t) ; 
rep i=0,N-2//Replicate part of text 
{ connect t [i] , stskSm [i] . iparSl ; 

connect stskSm [i] . IparSl , stskSm [i+1] . IparSl ; } 
merge s6. iparSl (stskSm [i] , i=0 ,N-1) ; 
connect s6 . oparSl , tof ile ; } 

Now we give some comments to the text above. Task, which is described by it, 
starts in node 1 and finishes in node 6. The task takes data from files, defined in 
section data. This section also defines where will be put output data when the task 
completes. Then we define node types - our “black boxes”. They are just types, not 
real nodes! Real nodes are defined in graph section as variables with types, defined 
above. 

For node 2 and node 4 we have to define readiness function. These nodes can 
organize a cycle. So, usual AND in readiness function cannot be applied for this case, 
because data can be received from several destinations. If node 2 starts with data in 
ipar22, it is a cycle, in ipar21, it is a direct flow of execution. By analogy, if node 4 
starts with data in ipar41, it is a cycle, in ipar42, it is a direct flow of execution. Thus, 
to organize a cycle we need to use readiness functions combined with merge. Another 
way to do this, it is use of additional node, which works like merge and unites data 
from several links into one output link. 

The language has several constructions to make easy SPMD tasks creation. They 
are: merges, which unite data from several links into one link; distributions, which cut 
data from one link into slices and then put this slices into distribution output links; 
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arrays of nodes, which multiply one subtask. Thus, to create SPMD task, we need to 
declare nodes array, to distribute data between these nodes with use of dist and to 
merge output data with use of merge, after all subtasks completion. 



Conclusion 

Thus, in this article the language, allowing describing task structure as a graph with 
informational links, admitting branching and cycles was described; the example of the 
graph description was considered. Graph nodes are executable files obtained by mean 
of traditional sequential programming (C for example). 

The task representation in the form of considered above information-control graphs 
can be useful (effects in virtualizing of resources and system reliability increasing) at 
the solving such complex tasks, as adaptive mobile objects management in conditions 
of non-stationary environment, where the complex counting algorithms depend on the 
movement mode and environment conditions are applied. Also dynamic task 
distribution provides additional security of cluster system, which works in networks 
with public access. 

At this time the subsystems of security, remote starting and tasks state monitoring 
with graph handling were partially implemented. In the near future further realization 
of the system is planned. 
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Abstract. In the framework of distributed object systems, this paper 
presents the concepts and an implementation of an overlapping mecha- 
nism between communication and computation. This mechanism allows 
to decrease the execution time of a remote method invocation with pa- 
rameters of large size. Its implementation and related experiments in the 
C++/ / language running on top of Globus and Nexus are described. 

Keywords: Distributed Objects, C++, Metacomputing, Nexus/Globus, 
Lightweight Process, Remote Method Invocation, Pipelining, Diture, 
Overlapping communication and computation 



1 Introduction 

1.1 General Objective 

Distributed supercomputing applications require large amounts of computational 
resources that often only computational grids environments can provide. The 
price to pay when executing on such environments is the mandatory use of 
a high latency, low throughput network. As a consequence, any solution that 
could help to lower communication costs would be worth considering. 

A basic idea is to overlap communication with computation, thus yielding to 
a pipeline effect regarding messages transmission. Any attempt to exploit this 
opportunity needs to rely on non-blocking elementary communications, such as 
for instance, asynchronous send and receive primitives as provided by well-known 
message-passing libraries (e.g. PVM [11] or MPI [15]). 

For code readability and portability purposes, one additional requirement is 
to make the use of the overlapping technique as much transparent as possible for 
programmers. As such, we reject distributed hand programmed solutions where 
the programmer would himself split the data to be sent into smaller pieces, 
asynchronously send each piece in turn thus “feeding” the pipeline, while at the 
receiver side, explicitly and repetitively receive each new piece and goes on with 
it in the related computation. 

Previous attempts to automatically make use of an overlapping mechanism 
between communication and computation have been successful in the context 
of data-parallel compiled languages. But as far as we know, this idea has never 
been investigated in the area of distributed object-oriented languages. 
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1.2 Formulation of the Problem 

The general idea featuring the concept of overlapping is that during a remote 
computation dealing with large data requiring transmission, communication and 
computation are automatically split in steps with a smaller data volume; then, 
it is only a question of pipelining these steps in order to achieve overlapping 
between the current step of the remote computation and the data transmission 
related to the next step of the remote computation. This requires executing a 
computation and a transmission step at the same time. One way to achieve 
this is to use non-blocking communications. 

Schematically, in the SPMD or SIMD programming models, a similar compu- 
tation has to be executed on each element of a large but fixed size data structure. 
So, the compiler or the run-time system is quite easily able to split it into small 
pieces, send each one in turn, apply the computation on each piece once it is re- 
ceived. If the compiler or the run-time system is not able to automatically decide 
how to split the data, the programmer can help. Thus, the implementation of 
this technique has generally been restricted to the field of data-parallel languages 
for parallel architectures with distributed memory: HPF [3], FortranD [17], but 
also in LOCCS [8], a library for communication routines and computation. 

But, how should the same problem be tackled with, in the area of distributed 
object-oriented languages ? In this context, the whole computation taking place 
on the distributed entities can be expressed as remote service invocations through 
method calls as RMI [16] in Java or RPC in C/C++ [2], even if ultimately very 
low-level communications, e.g., network communications, are used. In order to 
exhibit parallelism between distributed computations, a solution is to use asyn- 
chronous - or non-blocking - service invocations instead of blocking ones as 
featured by classical RPCs. Many models and languages have exploited this 
idea [4]. In particular, we have designed and implemented distributed exten- 
sions to object-oriented languages such as Eiffel, C++ and Java, that enforce 
sequential code reuse in a parallel and distributed setting [6,7]. In such languages 
extensions, each service invocation can be executed in parallel with the on-going 
computation. Once the result of the service is required, a wait-by-necessity mech- 
anism comes to help [5]. More information related to this model will be given in 
Sect. 3. 

In the implementation of such remote method invocation-based settings, all 
arguments of the method call must generally be received before the method 
execution starts. 

Main idea. The essence of our proposition is thus to apply a classical pipelin- 
ing idea to the arguments of a remote call: once the first part of the ar- 
guments has arrived, the method execution will be able to start. Moreover, it is 
only the type of the arguments that will automatically indicate how to split the 
data to send. In this way, programmers will be able to express, at a very high 
level, opportunities to introduce an overlapping of communications with compu- 
tation operations. Optimisation of the parameter copying process, as in [18] is a 
different but complementary approach. 
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As the rest of the paper will show, the way the technique is designed and 
implemented implies an easy and flexible usage for programmers, and, in some 
circumstances, remarkable performance gains on a LAN-based environment as 
well as on a WAN-based one (i.e. on a grid). 

1.3 Design Guidelines 

To implement this general idea, several problems have to be solved: 

1. design and implement elementary mechanisms, such as: data splitting, com- 
putation steps that can deal with partial data, . . . ; 

2. make it as much as possible a transparent mechanism for programmers, but 
give them the possibility to guide the data splitting; 

3. try to determine the appropriate size for data packets (i.e. try to estimate 
the duration of the different steps). 

Our contribution is to design, implement, and evaluate it within the con- 
text of an object-oriented language extended with mechanisms for parallelism 
and distribution, C++// [6]. Only points 1 and 2 are resolved in this paper. 
Automatically solving point 3 would require more precise information about the 
computation and the underlying communication performances (a strategy for 
data-parallelism languages running on dedicated parallel machines is developed 
for instance in [8]). 

As communication performances in the context of grid computing are quite 
unpredictable and vary dynamically, solving point 3 would be essentially man- 
ual (even if the programmer could be helped by some performance measurement 
tool) knowing that the benefits of the overlapping would also vary dynamically. 
As slicing of data into smaller units and also the corresponding slicing of com- 
putations seem to have to be manually done by the programmer, our solution 
can help: it provides an easy way in terms of programming effort, and a cheap 
way in terms of running cost, to describe and try to take advantage of pipelining 
in distributed object-oriented applications. 

Structure of the paper. In Sect. 2, requirements and steps for point 1 are dis- 
cussed. Then, strategies for splitting requests (point 2) are presented. Section 3 
introduces an implementation for this technique using the C++// language, 
whose runtime is based on both standard and lightweight processes (through 
Nexus and Globus). In Sect. 4, we present some benchmarks whose main pur- 
pose are to validate the technique and its implementation while exhibiting some 
cases where the gain is almost optimal. This work is an extension of [1] in the 
sense that, excepted the design, the implementation, experiments, analysis and 
learned lessons are new, as they arise in a broader context (multithreading plus 
metacomputing) . 

2 Communication/Computation Overlap 

This section presents the overlapping technique and the requirements for its 
implementation. 
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2.1 Elementgiry Mechanisms 

The following items are the building blocks of the technique: 

— send a request in pieces (without taking into account the strategy used for 
splitting it); 

— be able to rebuild a partial request in such a way that service execution can 
be started; 

— be able to integrate missing data when it arrives even if service execution 
has started; 

— be able to block the computation if it tries to use missing data. 

Step for Request Creation. In every system that proposes an RPC mechanism, 
the remote service request has to contain the method ID and the different pa- 
rameters of the call which are marshalled using a deep (i.e. recursive) copy of 
the objects graph^. After that, the request is sent. 

Requirement 1. Have access to the runtime code that sends requests in order 
to be able to decide vjhen to send a request piece. 

Step for Request Rebuilding. Once arrived in the remote system, the request is 
rebuilt: each parameter is reconstructed with the corresponding data and then 
the service can start. For implementing the overlapping technique, we have to 
be able to put a mark for the missing data. This mark informs the service that 
data are, temporarily, unavailable. 

Requirement 2. Have access to the runtime code that deals with the unmar- 
shalling of the request in order to manage marks of missing pieces. 

When the remote context receives a new part of a request that is already 
partially rebuilt, the context has to be able to deal with it in an automatic and 
transparent way regarding the service that is already executing. 

Requirement 3. A mechanism that receives and manages messages transpar- 
ently. 

Step for Service Execution. The service can run without any problem as long 
as it does not attempt to access missing data. An automatic and transparent 
blocking mechanism is required when it tries to use a missing data. In the same 
way, resumption has to be transparent and automatic. This requires a wait-by- 
necessity mechanism [5]. Such a mechanism is provided by the classical future 
mechanism as originally designed in Multilisp [12]. 

Requirement 4. Future types available from the programming language. 

Assuming the previous requirement is fulfilled, each missing data at the in- 
stantiation time of the request object is replaced with a data type presenting a 
future semantic. 

^ If a field of an object is a reference to a remote object, i.e. a proxy, we just flatten 
a copy of this proxy. 
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2.2 Strategies for Splitting a Request 

This section deals with the point 2 mentioned in the introduction. The crucial 
idea is to break, in the most transparent way for the programmers, the request 
parameters. It requires a modification of the marshalling/unmarshalling routines 
of objects. Whether these routines are generic or not, we have to be able to 
overload them. 

Requirement 5. Be able to change the default marshalling/unmarshalling rou- 
tines. 

Strategies can be split in two groups whether they modify or not the class of 
the objects involved in a request. 



With Class Modification A new class called later is introduced, from which 
all objects that require to be sent latter have to inherit from (see Code 1 for an 
example). Objects from these classes must not be sent (eventually also, not be 
marshalled) during the first inspection of the objects belonging to the request, 
but later, each one in a new message (as would be done for m2 when calling 
dom — ^ rang [ml, m2) in Code 2 for example). According to the previous re- 
quirements, later objects behave the same as future objects: automatic blocking 
when one tries to access to the value, transparent update of the object with the 
incoming value. 

This technique applies whether objects of later type sit at the first level (i.e. 
they are parameters of the remote call as m2 in Code 2), or at lower levels (i.e. 
they are parts of non-/ater parameters; for example each line of a matrix could be 
declared later whereas the matrix itself not). Notice that if needed, it is possible 
to cast an object declared as inheriting from later to the original type (e.g. from 
Matrix_Later to Matrix), and vice-versa. For example, if a later object must 
be used at the very beginning of the next remote call, it would be worth to cast 
it now to its original type in order to send it immediately. 

Code 1 (Definition of a later class). 

class Matrix_Later : 

public Later, public Matrix {. 

}; 



Without Class Modification. Two kinds of strategies come to mind: 

1. either a new routine could replace the one used by default by the language 
runtime in order to flatten the objects graph corresponding to a request. The 
new routine would split the graph, each obtained part being subsequently 
sent in a new message. Splitting strategies could rely on the algorithm used 
for traversing the graph (either breadth or depth first); 
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2. or, if the language allows that a class member function be used for the 
flatten operation instead of the standard one, a class could define its own 
customised flatten-splitting routine, in the same spirit as done when defining 
derived datatypes in MPI. For example, assume one parameter of the request 
be an instance of a Matrix class, the flatten routine overriding the default 
one could tell independently for each line how to flatten it and when to send 
it. 

Considering the first strategy implies that all arguments that need to be mar- 
shalled been split, whatever the potential benefit, i.e. without taking into account 
the use order of those arguments for instance. Whereas considering the second, 
while not transparent, gives the opportunity to give a more adequate splitting 
and even sending order. As such, one can consider that these two strategies lie 
at the two extremes of the spectrum, while the one using later types lies in- 
between. Indeed, casting an object to later or back to its original type, and be 
careful of argument positions in method signatures is a satisfactory compromise: 
it is not completely transparent for programmers which thus have some control 
on the splitting, but it does not require to define a specific marshalling routine 
for each type, which would be quite boring. So, we decided to only experiment 
with the strategy which relies on using later types. 



3 Prototype Environment 

We briefly present in this section our implementation of the overlapping mecha- 
nism. We use for this a parallel and distributed extension of C++, called C++//, 
whose runtime is based on communicating lightweight processes using the Nexus 
library and Globus [9]. 

3.1 C++// 

The C++// language [6] (http://www.inria.fr/oasis/c++ll/) was designed 
and implemented with the aim of importing reuse into parallel and concurrent 
programming. It does not extend the language syntax, and requires no modi- 
fication of the C++ compiler, since C++// is implemented as a library and a 
preprocessor (relying on a Meta-Object Protocol [13] - MOP). 

C++// provides a heterogeneous model with both passive and active ob- 
jects. Active objects act as sequential processes serving requests (i.e. method 
invocations) in a centralized and explicit manner by default (such objects are 
instances of subclasses of the specific C++// class Process). Communications 
towards active objects are systematically asynchronous. There are no shared pas- 
sive objects (only call-by-value between processes, implying making deep copies 
of request /reply parameters, like serialization in Java RMI). 

The MOP is centered around points concerning RPC, where some reification 
is applied: request send, request receive, reply send, reply receive. These points 
manipulate requests or replies as first-class objects. Generic flatten and rebuild 
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functions are used for these objects. The reply of a service invocation is trans- 
parently built as a future. Access through method invocation to any object of 
future type is reified and blocks the caller if the result is not back yet. 

Part of C++// runtime based on Nexus. Nexus [10] is a library used for both 
communications and lightweight processes (threads) in distributed applications, 
which provides the notion of remote service execution. 

A C++// active object is implemented by using a lightweight process on a pos- 
sibly remote Nexus context. A requests queue of an active CPAj j object can 
be remotely referenced thanks to the definition of a Nexus global pointer. A 
request for an active object is remotely queued by invoking a remote service 
at the Nexus level (named Queue_a_request) . This service takes as arguments 
a C++// service request id, and a list of C++// objects as parameters. Each 
such object is flattened using the generic flat() method of C++//. Execut- 
ing a Queue_a_request service implies launching a new thread whose code is the 
effective queuing of the C++// request in the queue of the target C++// ob- 
ject, after its parameters have been unmarshalled (the generic C++// build () 
method is used for this purpose). Concurrency between request queue filling 
and request queue extracting is managed with Nexus local mutual exclusion 
primitives. 

Part of C++// runtime based on Globus. C++//relies on the GRAM mecha- 
nism [9] to acquire nodes on a remote host and allocate active objects on a new 
machine. To help the programmer in this task, C++//provides a simple file to 
specify the mapping. Eor example : 

mO ll.inria.fr GLOBUS /0/sloop2/dsagnol/ecll/tests/gtk2/sc99Demo_slave 
ml pitcairn.mcs.anl.gov GLOBUS /nfs/dsl-homes02/caromel/sc99Demo_slave 
m2 das3fs.tn.tudelft.nl GLOBUS /home/caromel/sc99Demo_slave 
m3 bolas.isi.edu GLOBUS /nfs/v6/caromel/sc99Demo_slave 

The strings mO . . . m-3 are the virtual names of the machines that we use in the 
program. With this mechanism, we can change the mapping of the application 
without recompiling it. 

3.2 Implementation of the Overlapping Technique in C -| — Y j / 

At the MOP level, the main modification is to write a new generic function 
to flatten requests (see Requir. 5): this function builds a first fragment which 
holds the request header and the non-/ater parameters, and then one fragment 
for each parameter of later type. Then at the runtime level (see Requir. 1), 
the Queue -a -request service is remotely called for the first fragment, while a 
new defined service Updatc-Later is called for the remaining fragments. The 
Queuc-a-request service has been slightly modified in order to manage marks for 
missing objects (see Requir. 2). The newly defined service Updatc-Later trans- 
parently updates the corresponding awaited request parameters (see Requir. 3). 
As seen here, implementing the overlapping technique requires only minor mod- 
ifications in the C++// language runtime support. 
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In order to switch from a later type to the original type and vice-versa, 
the MOP of C-|--|-// provides two primitives. Being of type later implies being 
accessed through a proxy, casting to the original type means discarding the proxy 
and returning a pointer to the original object. 



4 Validation 

4.1 Benchmark 

We designed a simple test and benchmarked it. This test must not be considered 
as a real application, but as a means to validate the effectiveness of the technique. 

Program. The test is based on the remote call of the method OpMatrixiirangO 
(see Code 2) which takes two matrices, squares the first one, and adds the second 
one. As the second matrix m2 is of type Matrix_Later, it can be used as a 
parameter of OpMatrix::rang() . The remote service can start as soon as the 
request id and the non-/ater parameters have been received. Experiments not 
using the overlapping technique are easily conducted : define m2 as an instance 
of Matrix instead of Matrix.Later. 




The technique should allow to overlap the remote execution requiring only 
ml (i.e. the method ml— tsquareO) with the transmission and reception of the 
later parameter (i.e. the matrix m2) that is only useful for the second part of the 
service execution (i.e. m2— iplus(ml)). Compared with an execution not using 
the overlapping technique, the duration of ml— tsquareO should increase, since, 
at the same time, the remote processor has also to manage the reception and 
update of the matrix m2. 

In the framework of this test, we measure various durations (see Fig. 1). The 
first, total-duration is the total duration of the complete call as perceived by the 
caller. This is the duration that will be reduced using the overlapping technique. 
Duration dl is the time when using onlyml in the computation (ml— ^squareO), 
while d2 is the time requiring both matrices (i.e. m2— tplus(ml) ). Both compu- 
tations depend on the matrix size (for simplicity, both matrices are of the same 
size). Moreover, in order to experiment with longer computations thus with 
situations where there is more opportunity for some overlapping to occur, the 
duration dl can vary: a parameter, say p > 1, is given to the test, and the 
computation inside ml— tsquareO is called p times. 
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CALLER SIDE CALLEE SIDE 



Total 

duration 




transfer duration before 
being able to start the service 



dl 



d2 



Fig. 1. Temporal decomposition of a - blocking - remote service call 




Fig. 2. Execution of the remote service (caller side, totaLduration in /is) 



4.2 Results using CH — |-// 

Apart from proving the correctness of the overlapping technique implementation, 
we will show that the obtained results are scalable and can yield optimal gains. 
The formal definition of what we mean by gain will be given at the end of this 
subsection. We begin by some LAN-based tests (two Sun Solaris 2.6 workstations 
with 128 MB of RAM, interconnected by a 10 Mbits Ethernet are used), followed 
by some Globus -based ones. 

The two curves plotted in Figs. 2 and 3 show that when the remote com- 
putation duration that does not access to later parameters increases (dl), then 
the benefit also increases. Indeed, because of the use of lightweight processes, 
computation using only ml and reception of m2 have more opportunity to be 
interleaved when dl increases. 

Moreover, the reception related operations do not disturb very much the 
on-going computation (see Fig. 4), although they arise while ml— isquareO is 
being executed. To claim this, we must be sure that the reception related oper- 
ations indeed arise while ml— isquareO is being executed (and not latter when 
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Matrix size (total number of integers) 



Fig. 3. Execution of the remote service (caller side, totaLduration in /is). 
ml— )-square() (dl) is 4 times longer than in Fig. 2 




Fig. 4. Duration in jis of the remote computation not using later parameters (dl). 
The tests correspond to those of Fig. 3 



m2— fplus(ml) is already started). The answer is given by Fig. 5 where one can 
observe that almost no reception related operations have arisen in m2— tplus (ml) . 

The overlapping technique used in this context where lightweight processes 
are available, scales very well, as Fig. 6 shows it. As distributed computing on grid 
environments is mainly justified by huge data sets, this is an interesting property. 
Moreover, we deduce against our past experiences that only runtime supports 
using lightweight processes can scale so well. Indeed, benchmarks conducted in 
the context of C++// on top of PvM [1] proved that the amount of data that 
could be sent and received while the remote service is in progress, is bounded by 
the remote receiving buffer size. The fundamental reason is that the transport- 
level layer can not gain the receiver process attention while this latter is engaged 
in a remote computation (i.e. ml— tsquareO), due to the lack of a dedicated 
concurrent receiving thread. 

We have also tested the use of a multi-processor workstation for the remote 
service execution. All experiments we have conducted on this platform occurred 
while it was unloaded, so that we could assume that at least 2 CPUs were idle. 
In this case, the computation not using later parameters (i.e. ml— tsquareO ) 
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Fig. 5. Duration in jis of the remote computation using later parameters (d2). The 
tests correspond to those of Fig. 3 




Fig. 6. Execution of the remote service with large matrix sizes (caller side, to- 
taLduration in jas). Same conhguration as tests of Fig. 3 



is absolutely not disturbed compared with experiments where the overlapping 
technique is not active. This confirms the fact that the reception of later pa- 
rameters is effectively executed in parallel with the computations not using later 
parameters. This gives us confidence that the way the technique is implemented, 
i.e. based on lightweight processes, provides really concurrent activities that can 
even be executed in parallel, in this case yielding an unmeasurable overhead. 

Gain. Let us define a gain (G) in order to give a concrete estimation of the 
benefit. 



^ -Using -Overlap -overlap 

laier -parameters -transfer -dura, tion ^ 

duration^__using -overlap represents the total duration of the remote service exe- 
cution in either case (using or not the overlapping technique). The duration for 
transferring la, ter parameters, i.e. m2, is estimated by sending a C++// object of 
the same size, not counting the - small - additional cost that would be required 
for managing a later parameter (a few milliseconds). 
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Fig. 7. Benefit (G) obtained from using the overlapping technique on a LAN. Corre- 
spond to the tests of Fig. 3 



Expected values of G are in [0, 1[ : it means that the transfer duration of later- 
parameters has been overlapped by some useful computation occurring at the 
callee side (i.e. ml— tsquareO ). To avoid negative values for G, the only condition 
is that later parameters and computation duration be sufficiently large, such as 
to mask the - small - overhead of the technique (see Fig. 7). Obtaining a value of 
G greater than 1 is not related to the overlapping technique but of the variable 
network loads (especially noticeable on a WAN, see Fig. 9 and [1]). 



4.3 Discussion 

Using an environment where computation and reception executions are parallel 
or pseudo-parallel enables to really take advantage of our technique, thus leading 
to a gain close to the optimal possible value, as computed by G and shown in 
Figs. 7 and 8. 

But, one should notice that the duration of the remote computation is of 
course an other crucial point. Indeed, if it is really too short compared with 
the transmission speed, almost no communication overlapping occurs. This is 
why the grid-based experiments plotted in Figs. 8 and 9 assigned dl to be 300 
times higher than in experiments plotted in Fig. 3. Even if the matrix size was 
4 times smaller, this arbitrary choice for such a high value for dl lead to a 
sufficiently high remote computation duration, in the same order of magnitude 
as communication delays. It is reasonable to expect that transmitting a large or 
even huge volume of data to remote computers (especially on a grid) is justified 
by the need to execute quite costly computations on these data. 

An other important factor is related to the transmission delays. If they are 
very low because either the number of transmitted b^es is small, or the network 
speed is really good as on a LAN, then the technique can yield to a gain but which 
can prove in fact to be negligible (for instance, if we spare a few milliseconds 
only). If we now integrate the overhead oft the technique (a few milliseconds 
of computation time only), then we can see that the benefit (even if optimal if 
all the transmission has been overlapped) can sometimes be overridden by the 
overhead. This can effectively arise on LAN-based environments as Fig. 7 plots 
it for small matrix sizes (observe the negative values for G). On the contrary, on 
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Fig. 8. Execution of the remote service (caller side, totaLduration in /is) and corre- 
sponding gain. This corresponds to one Globus-based test between Argonne and INRIA 
during night period, with (dl) 300 times longer than in Fig. 3 



WAN-based environments, sparing the transmission time of even a few b^es^ 
yields a positive gain that the overhead of the technique can not override (due 
to so high transmission delays): observe for instance in Fig. 8 the fact that G is 
greater than 0. 

We thus advocate to turn the overlapping technique on for every remote 
service invocation whose related communications occur on a WAN. Depending 
on the remote computation algorithm and its parameters usage (which implies 
how to best split parameters transmission through their cast into later type), 
the benefit can in some cases even rise close to the optimal possible value, i.e. 
where the whole transmission time of later parameters has been spared. 



5 Conclusion 

In this paper, we have defined and implemented a mechanism to overlap com- 
putations with communications in distributed object-oriented languages. 



^ More precisely, the total duration for the test in Fig. 8 using matrices ml and m2 
of 2500 integers decreases from 680816 microseconds not using the overlapping tech- 
nique to 556446 microseconds when using it. 
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Matrix size (total number of integers) 



Fig. 9. Execution of the remote service (caller side, totaLduration in /is). This corre- 
sponds to one Globus-based test between Argonne and INRIA during day period, with 
(dl) 300 times longer than in Fig. 3 



Performances. This mechanism is interesting for environments based on light- 
weight processes, because they enable to make the transfer of later objects par- 
allel with the on-going remote service execution. The technique scales very well, 
and its use dramatically decreases the total duration of the service execution as 
soon as operations on non-(ater parameters take enough time to enable the par- 
allel execution of later parameters transmission. In this last case, this becomes 
clearly an advantage for applications running on high latency WANs (see Figs. 8 
and 9) where several seconds in transmission time can be spared. Nevertheless, 
be aware that there is a small overhead when accessing objects of later type be- 
cause the access is reified. An experiments-&-measurements analysis tool could 
help programmers to decide when to turn the overlapping mechanism on or off. 
Such a tool could extract the same kind of numerical results than described in 
Sect. 4 (e.g. extract dl, d2, . . .dn and total-duration out of the experiments with 
or without using the overlapping mechanism, compute the related value for G). 

Ease of use. As exemplified in Code 2, the programmer has to manually split 
data into smaller units, but this only requires to change the type of the pa- 
rameters (make them inherit from the later class). To take advantage of the 
mechanism, the remote computation does not necessarily need a specific design 
(or redesign). The only important point is that the order the various param- 
eters are first used should closely follow the order they are sent and received. 
So, the position of later parameters in method signatures becomes important. 
This ease of use is an argument in favour of a systematic usage of the technique, 
even if the benefits are not always here as they could depend from unpredictable 
communication durations especially on the grid. 

Implementation. The requirement to implement the overlapping technique in 
an object-oriented distributed language is mainly to have free access to the 
transport layer and a MOP for the language. If so, essentially only the flatten 
and rebuild phases of remote procedure calls need to be modified: the object 
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representing the remote call has to be fragmented into several pieces. Those 
phases need only to use a mechanism offering a future semantic, lhansparent 
reception and management for later fragments is required at the runtime support 
level. Such a mechanism is of widespread use, and is in particular available in 
Nexus, and in PM^ [14], both of them acting as “low-level” runtime supports 
for parallel and distributed computations. 
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Abstract. Coordination systems based on multiple tuple spaces, 
namely those inspired by and extending the Linda coordination language, 
targeted to design and implement open distributed systems are experi- 
encing some popularity thanks to the flexibility and dinamicity of the 
Java language: examples are Sun’s JavaSpaces and IBM’s TSpaces. 

By integrating coordination and mobility, a flexible technology supported 
by the “run-every where” feature of Java, we developed WebCluster. Web- 
Cluster is a meta-application development system: namely a Web-based 
system that enables the implementation of Web-accessible, agent ori- 
ented, distributed applications. The application target of WebCluster is 
the class of computationally intensive applications based on the master- 
worker architecture. 



1 Introduction 

The integration of existing and new technologies often leads to the development 
of new applications classes. The most important examples of this concept are the 
current generation of Web-based, agent-oriented systems. In these systems the 
Web technology is integrated with existing technologies to improve the remote 
accessibility of distributed applications which implement some form of agents, 
namely autonomous programs which cooperate to solve some problem or to offer 
some service. Examples are applications that are simple front-ends of existing 
tools (like Web-based e-mail systems, discussion groups and the likes) but also 
applications that use the Web to enable the integration and the cooperation 
among multiple components providing a uniform user-interface (a notable ex- 
ample of these systems is the Source Force project [1] that integrates a set of 
cooperative programming tools like CVS, bug tracking systems, mailing lists and 
so on). 

In most of these systems, however, the relationships among the various com- 
ponents are handled using ad hoc programming techniques, usually based on 
scripting languages. It seems then natural the idea of using some coordination 
technique to ease the development this kind of software. One of the definition 
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of coordination is, in fact, the glue that enables components to cooperate and 
interact [5], Starting from the above observation we designed PageSpace [4], a 
reference architecture to support distributed applications on the WWW based 
on coordination technologies. The work we introduce in this paper is our first 
attempt to design a meta-application development system based on the PageS- 
pace concepts. We describe a Web-based environment that enable the users to 
build their own distributed, Web-accessible applications by using a coordination 
technology to enable component integration. In this very first system we decided 
to keep the things as simple as possible and in fact WebCluster, the application 
we present, is targeted to the development of rather simple application based on 
a master- worker architecture in a LAN environment. WebCluster itself is a Web- 
accessible system that integrates the Web with coordination and code mobility 
[ 2 ]- 

This paper is structured as follows: Sect. 2 introduces Jada, the coordination 
system on which WebCluster is based; Sect. 3 describes the architecture of We- 
bCluster; in Sect. 4 an example application based on WebCluster is presented; 
Sect. 5 concludes the paper. 

2 Jada 

Jada [3] is a coordination language for Java that can be used to coordinate par- 
allel/distributed components inspired by Linda [5]. Jada extends Linda’s basic 
concepts by implementing new primitives, replacing tuple spaces with object 
spaces (i.e. specialized object containers) and enabling the creation of multiple 
spaces [6]. 

Jada’s basic coordination entity is the Space. Concurrent threads can access 
a space by using a small yet effective set of primitives that are made available 
as methods of the Space class, in, read and out primitives are used to post 
an object into a space, to associatively get a copy of on object from a space or 
to associatively remove a object from a space, respectively. Associative access is 
performed by passing to the input primitives a template, an object that has to 
match (using a defined matching mechanism) the returned object. 

Jada also provides the users with the readAll primitive that returns all the 
objects that match a given template and with the getAll and getAny primitives 
that return, respectively, all the objects that match or any object that matches 
a set of templates. All the input primitives can be associated to a timeout, 
interpreted as the time within which the primitive has to be performed. Input 
primitives are never blocking: they return an object that is an instance of the 
Result class. This object provides users with methods to check whether the 
operation has been successfully performed, whether it has been canceled (either 
by the user or because the timeout is over) , and to gather its result. Gathering the 
result is a blocking operation: if the result is not yet available, the calling thread 
blocks until either the operation is successfully performed, or it is canceled. 
Output primitives can specify an associated time-to-live: when this time is over, 
the object emitted in the object space can be reclaimed by the garbage collector 
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(the actual time-to-live used is the minimum between the one requested by the 
agent and the space default one). 

The matching policy used by Jada is very simple and easily extensible. 
Templates (formals) are represented by instances of the Class class, the Java 
meta-class. A template representing an Integer class, for instance, matches any 
Integer object. Actual to actual matching is delegated to the standard Java 
equals method in the general case, and to the ad hoc matches method when 
the objects implement the JadaObject interface. This mechanism is used in par- 
ticular to customize the matching in the Tuple class, which is an ordered object 
container used to mimic Linda tuples. This class defines its matching policy by 
implementing the matches method so that two Tuple objects a and b match 
when: 

— a and b have the same number of fields; 

— each field in a matches the corresponding field in b using the standard Jada 

matching mechanism. 

The same mechanism can be applied to any user-supplied class. 

Jada provides users with a client/server based technology that enables dis- 
tributed components to access an object space uniformly. Moreover, since an 
object space is a Java object, any application can create several object spaces 
and even several server objects spaces. The same paradigm can then be used to 
achieve data driven coordination in both parallel and distributed applications - 
though the access to a remote object space can obviously fail because of network 
troubles. 

Security in Jada is addressed at two levels: by enforcing access control policies 
on a per-space basis, and by supporting data encryption when accessing a remote 
space. While the second mechanism obviously applies to remote spaces only, the 
first can also be used when concurrent threads access a local, shared object space. 
One of the advantages of this approach is that adopting a space-based access 
control enables uniform security policies to be used for both the concurrent and 
the distributed case, which is particularly useful for mobile agents. 

In the last few years Jada has been used for several research projects and 
to implement quite different systems, from parallel computing to Internet card 
games, from distributed collaborative applications to mobile agents systems. 

3 Architecture of WebCluster 

WebCluster is a Web-accessible distributed computation system that allows re- 
mote users to upload Java-based workers agents into the run-time environment. 
Remote applications, or applets, can submit jobs for the workers agents and 
gather computation results by using Jada object spaces. 

Distributed applications on WebCluster are called projects. The components 
of a typical project are: 

— an object space used to post jobs and results; 
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— a set of worker agents (run by the WebCluster run time system on every 
available host in the remote local area network); 

— a master agent, usually an applet that can be downloaded from the Web- 
Cluster HTTP server. 

The components are uploaded by the users to WebCluster by the mean of a 
Web-based interface that allows the remote administration of the projects. Ad- 
ministration options include: 

— the creation of a new project; 

— the uploading of updated code for an existing project; 

— the activation of an existing project; 

— the suspension of an existing project. 

When a new project is created the Java code for the workers agents and for the 
master applet are uploaded into WebCluster (along with an HTML document 
that is used as the project’s homepage and that usually include a reference to 
the master applet in order to enable access to the project from the Web). When 
this operation is accomplished the application can be activated. 

When a project is activated the WebCluster’s Coordinator component cre- 
ates a new Jada object space that can be used by the project agents to post 
jobs and results. Coordinator also uses the WebCluster’s main object space to 
notify available workstations that new workers agents have to be activated. We- 
bCluster is designed so that new workstations can join the system at any time. 
At this point a user can download the project’s homepage to execute the master 
applet that will post computation jobs into the WebCluster. The architecture of 
WebCluster is shown in Fig.f. Dashed lines are used to remark the containment 
of a set of components inside the same physical host. Jada spaces are not shown 
as contained to remark that they can be created by the Coordinator in any 
available host. 

4 The Mandelbrot Set Explorer Application 

The Mandelbrot Set Explorer is an application written to be run by WebCluster. 
The idea is to easily decompose the whole area to calculate in smaller sub-areas, 
creating a set of jobs, and delegate the computation of the jobs to workers agents. 
In the Frequently Asked Questions of the comp .parallel Usenet group, the 
computation of the Mandelbrot set is defined as “embarrassingly parallel” and 
this is one of the reasons for choosing it as a test bed for WebCluster. It is not the 
aim of this paper, in fact, to show how to write an application for WebCluster 
that exploits some “smart” parallelization technique but, rather, to show how 
to use WebCluster for a “bare-bone” parallel algorithm. 

The components of the Mandelbrot Set Explorer applications are two: the 
master and the worker(s). The master itself is composed by two sub-components: 
the engine (the component that splits the whole computation in jobs, posts the 
jobs into the remote space and gather the results) and the user interface (that 
reads the user input and displays the calculated set on the screen). 
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Fig. 1. WebCluster architecture 



As with every WebCluster projects, when the Mandelbrot Set Explorer is 
activated the workers are propagated to the hosts in the cluster. When they 
start as new computational threads, a reference to the project’s space is passed 
to them by the run-time system. At this point the workers start an endless loop 
in which they get a new job from the space (a blocking operation), compute the 
job and put the result back into the space. 

Since jobs in the space can come from different concurrent masters (multiple 
users can, in fact, access the project at the same time) a field in the job descriptor 
is used to label the generator of the job; a corresponding field is used into the 
result descriptor so that results can be retried only from the generator of the 
corresponding jobs. 

In order to enable masters to label their jobs with non conflicting labels 
the Coordinator put the ("index" , 0) initial tuple into newly created spaces. 
Masters in the Mandelbrot Set Explorer project use this tuple to generate a 
sequence of unique labels. 

When a remote master is stopped without waiting for all the results to be 
gathered, unclaimed results objects could pollute the project’s spaces. In order 
to overcome this problem the default time-to-live for the objects in the project 
spaces is set to a reasonable amount so that when results stay unclaimed for 
more than one hour they can be claimed by the garbage collector. 

Eigure 2 shows, on the left, the WebCluster interface for creating a new 
project and, on the right, the Mandelbrot Set Explorer application running. 

5 Conclusions 

WebCluster is a a Web-accessible distributed computation system that exploits 
coordination and mobile code technologies. While it has be implemented mostly 
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Fig. 2. WebCluster: creating a new project and the Mandelbrot Set Explorer 



as a proof of concept for the design of a meta-application development system 
it is actually a running, usable system. Still, WebCluster lacks some valuable 
feature like the ability to build three-tier applications moving the master into 
the WebCluster and leaving just the input /output component into the browser, 
something that should be possible with a more flexible system. We are currently 
investigating on how to implement this kind of features while keeping the system 
as simple as possible. We are also engaged in the design of more generic meta- 
application development systems for deployment of distributed, collaborative 
applications accessible from the Web. 
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Abstract. Debuggers are used to control the state of many processes, 
to present distributed information in a concise and clear way, to observe 
the execution behavior, to detect and to locate programming errors. In 
this paper we briefly describe the design of SPiDER which is an in- 
teractive source-level debugging system for both regular and irregular 
High Performance Fortran programs. SPiDER allows to inspect a single 
process of a parallel program or to examine the entire program from 
a global point of view. A sophisticated visualization system has been 
developed and included in SPiDER to visualize data distributions, data- 
to-processor mapping relationships, and array values. SPiDER enables 
a programmer to dynamically change data distributions as well as array 
values. For arrays whose distribution can change during program exe- 
cution, an animated replay displays the distribution sequence together 
with the associated source code location. Array values can be stored at 
individual execution points and compared against each other to examine 
execution behavior (e.g. convergence behavior of a numerical algorithm). 
SPiDER has been fully implemented and is currently being used for the 
development of various real-world applications. Several experiments will 
be presented that demonstrate the usefulness of SPiDER. 



1 Introduction 

In recent years, parallel processing has evolved to a wider-spread technology for 
delivering parallel computing capability across a range of parallel architectures. 

* This research is partially supported by the Austrian Science Fund as part of Aurora 
Project under contract SFBF1104. 
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Unfortunately, availability of parallel systems does not imply ease of use. Hence, 
there has been an increased emphasis on parallel programming environments, in- 
cluding parallel language systems and tools for performance analysis, debugging, 
and visualization. 

In this paper we briefly describe the design of SPiDER which is an interactive 
source- level debugging system for High Performance Fortran programs and lever- 
ages the HPF language, compiler, and runtime system to address the problem of 
providing high-level access to distributed data. SPiDER has been developed as 
part of the long-term AURORA project [1] where several real-world applications 
have been parallelized based on HPF. For the development of SPiDER in the 
context of AURORA we had several objectives in mind: 

— Support programmers to observe and understand the execution behavior of 
their programs. 

— Detect and locate programming errors at the high-level HPF code instead 
of low-level message passing program. 

— Provide support for debugging an entire program from a global point of view 
instead of debugging individual processes. 

— Enable sophisticated data distribution steering and animation as well as 
visualization and comparison of array values. 

— Provide support to examine the quality of data distribution strategies. 

— Develop debugging technology that is capable of handling both regular and 
irregular parallel programs. 

The development of SPiDER is a joint effort among several research groups 
in Austria, Germany and Poland. SPiDER integrates a base debugging system 
(Technical University of Munich [17] and AGH Gracow [4]) for message passing 
programs with a high-level debugger (University of Vienna) that interfaces with 
VFG (University of Vienna [2]), a Fortran90/HPF compiler. The visualization 
system of SPiDER which is crucial to achieve the design objectives mentioned 
above, consists of two subsystems. Firstly, a graphical user interface displays 
the source code and allows the programmer to control execution, to inspect and 
to modify the program state. Secondly, GDDT (University of Linz [14]) is a 
sophisticated system to visualize data distributions and array values, to animate 
array distribution sequences and to display how well data has been distributed 
across all processors. 

In the next section we give an overview of the VFG compiler and the most 
important HPF language constructs necessary to describe some of the function- 
ality of SPiDER and are necessary to understand our experiments. In Sects. 3 we 
describe SPiDER as a multi-layer system comprising of VFG, HPF dependent de- 
bugging system, base debugging system, data distribution steering facility, and a 
graphical user interface. Experiments to demonstrate the usefulness of SPiDER 
are described in Sects. 4. Related work is discussed in Sect. 5 and concluding 
remarks are given in Sect. 6. 
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2 VFC Compiler and High Performance Fortran 

The Vienna High Performance Compiler (VFC - [2]) is a command-line source- 
to-source parallelization system that translates Fortran90/HPF+ programs 
into Fortran90/MPI message-passing SPMD (single-program-multiple-data) 
programs. The SPMD model implies that each processor is executing the same 
program based on a different data domain. 

The input language to VFC is Fortran90/HPF+ where HPF+ [2] is an im- 
proved variant of HPF (High Performance Fortran) language. HPF consists of 
a set of language extensions for Fortran to alleviate data parallel programming. 
The main concept of HPF relies on data distribution. A programmer writes a 
sequential program and specifies how the data space of a program should be dis- 
tributed by adding data distribution directives to the declarations of arrays. It 
is then the responsibility of the compiler to translate a program containing such 
directives into an efficient parallel SPMD target program using explicit message 
passing on distributed memory machines. 

The core element of HPF is the specification of data distribution, which 
is expressed by the DISTRIBUTE directive. HPF supports a two-level mapping 
model where arrays must be at first aligned to a template and then the template 
is distributed onto a processor array. Processor arrays are declared by using 
the PROCESSORS directive. For every array dimension the distribution is speci- 
fied separately. HPF+ extends the standard HPF set of distribution methods 
(replicated, block, cyclic, block-cyclic) with the generalized block and indirect 
distributions which allow for more flexible distribution methods especially useful 
for irregular problems. 



3 SPiDER 

SPiDER[4] - an advanced symbolic debugging system for Fortran90/HPF paral- 
lel programs enables to control and to monitor program processes at the source 
code level. Multiple process view of the program enable a programmer to exam- 
ine a single process of a parallel program or to inspect the entire program from 
a global point of view. SPiDER allows to examine distributed data structures 
which are visible as a single global entity, i.e., a programmer can inspect and 
modify a section or individual elements of distributed arrays without the need 
to specify on which processor the elements reside. Moreover, SPiDER provides 
support for regular and irregular applications with several exceptional features 
for visualization and steering of data distributions. Data distribution can be 
dynamically changed after stopping program execution at a breakpoint. Sophis- 
ticated visualization capabilities provide graphical representation of array values 
and data distribution with convenient navigation facilities for distributed data 
and logical processor arrays. It also allows to store up to seven snapshots of array 
contents of a given array and visualize differences between them. For complex 
applications in which the distribution of arrays changes many times during pro- 
gram execution, SPiDER provides an animated replay of the array redistribution 
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Fig. 1. Architecture of the HPF debugging system 



sequence and allows to observe the migration of arbitrary array elements in a 
stepwise or continuous mode. Finally, SPiDER supports a load diagram that 
visualizes how many array elements have been mapped to every processor. This 
feature enables a programmer to examine how even data has been distributed 
across all processors. 

Figure 1 shows the architecture of SPiDER with an emphasis on the sup- 
port provided by VFC and low- and high-level debugging technology. The input 
programs of SPiDER are compiled with VFC to FortranOO message programs. 
In order to generate an executable file, a vendor back-end FortranOO compiler 
is used. The two-stage compilation process is reflected in the debugger architec- 
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ture. The main parts of the system are the Base Debugging System, (BDS) and 
the HPF Dependent Debugging System, (HDDS). 

BDS operates as a low level debugger closely related to the target machine 
on which it is running on. It resolves all platform specific issues and hides them 
from the HDDS level. It also constitutes a clear, simple but unequivocal interface 
that provides functionality which allows to inspect the state of processes and 
values of data in the parallel program. BDS does not check for consistency of 
the running application with the HPF source code but provides information to 
HDDS about every process of the program. The design of BDS partially relies 
on the DETOP parallel debugger [17] and the OCM [18] monitoring system 
developed at LRR-TUM. HDDS works on top of BDS and provides a higher 
level functionality to allow viewing the associated HPF source code of the target 
parallel program and to interactively control and alter the application data. The 
interface of SPiDER to VFC is supported by a symbol table file which includes 
mapping information about mutually corresponding lines and symbols in HPF 
and the resulting FortranOO message passing programs, and information about 
compiler transformations. 

A programmer interacts with SPiDER by using the visualization system 
which consists of a Graphical User Interface (GUI) and a Graphical Data Distri- 
bution Tool (GDDT) for visualization of HPF data structures[14] (see Sect. 3.3). 

The debugging commands offered by SPiDER can be subdivided into six 
classes: 

1. Execution control: SPiDER enables to start, stop, and single step either 
the entire program or a specific process at any given time (see Figures 2 and 
3). 

2. Inspection of program state: These commands allow to retrieve informa- 
tion on the program’s current state, e.g. the list of all existing processes, the 
current point of execution, back-trace of procedure calls, and the types and 
values of variables (distributed and replicated) or expressions (see Figures 
2, 3, and 4). 

3. Visualization of distributed data: These commands invoke GDDT to 
graphically visualize data distributions and array values (see Figures 2, and 
3). Moreover, a history of data distributions and array value changes can be 
displayed. 

4. Modification of program state: A set of commands is provided to modify 
the contents of variables and to change data distributions. 

5. Events and actions (breakpoints): Breakpoints (see Figures 2, and 3) 
may be set on source code lines or procedure entries for an arbitrary set of 
processes. A breakpoint consists of an execution event and an associated stop 
action. The event is raised whenever one of the selected processes reaches 
the given position in the source code. The stop action can either stop the 
process that raised the event, the processing node (on which a process is 
executing) or the entire program. These modes are essential in order to 
obtain consistent views of shared variables or the program’s global state. 
Additionally, there are several events that are permanently monitored by 
SPiDER, e.g. exceptions or termination of a process. 
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6. Miscellaneous: There are also commands to display a source file, to set 
defaults, e.g. a default action for breakpoints, and to configure the graphical 
interface. 



3.1 Data Distribution Steering 

The capability to modify variable values in order to influence the behavior of a 
program is a very important feature of traditional debuggers. For long-running 
applications the programmer may inspect the program state at a given break- 
point and also control parameters that impact the program’s performance. Spec- 
ifying data distributions is of paramount importance to impact the performance 
of HPF programs. Therefore, the ability to steer the selection of data distribu- 
tions provides the programmer with an excellent capability for program tuning. 
However, changing data distributions during program execution must be done 
with great care. Otherwise compiler assumptions about data distributions may 
become invalid, which can result in incorrect program behavior. Compilers per- 
form various optimizations and transformations based on the assumption of a 
single data distribution (or possibly a set of them) that hold at a specific pro- 
gram point. If a programmer changes the data distribution during a debugging 
session such assumptions may become invalid. It mostly depends on the underly- 
ing compiler whether interactive redistribution of an array at a specific program 
point is valid or not. In the following we discuss important issues for interactive 
array redistribution under SPiDER. 

In HPF the DYNAMIC directive is used to specify that the distribution of an 
array can be changed during program execution. All other arrays are assumed to 
be statically distributed (distribution cannot be changed during execution). For 
DYNAMIC arrays, compilers may assume that the associated distribution strategy 
is always unknown and, therefore, generate code that is distribution transparent, 
for instance its behavior does not depend on the distribution of arrays and re- 
frain from performing any distribution-driven optimizations. However, advanced 
compiler technology may determine the set of possible distributions of an array 
at a given program point. This information can enable more optimized code 
generation which usually implies a reduced runtime overhead. 

The execution of an SPMD parallel program commonly consists of interleaved 
phases of independent computation and communication phases. Note that inde- 
pendent computation phases are not restricted to code sections associated with 
the INDEPENDENT directive. The processes of a parallel program are not synchro- 
nized during a computation phase and every process may execute a different line 
of code at any given point in time. In many cases breakpoints serve as synchro- 
nization points where the debugger can provide a consistent view of the program 
execution and data (for instance, single execution point of processes and single 
value of replicated variables). However, there are exceptions based on the par- 
allel nature of some HPF constructs that break the consistency of the program. 
The most typical example is a DO INDEPENDENT loop nest where every process 
executes a unique set of loop iterations. In order to enable a genuine parallel 
execution of DO INDEPENDENT loops, all data read or written by any process has 
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to be local, otherwise non-local accesses would synchronize the execution. Invok- 
ing an array redistribution during execution of a DO INDEPENDENT loop would 
change the placement of array elements and as a consequence may invalidate the 
current work distribution. 

VFC employs the inspector/executor strategy (see Sect. 2) to implement 
DO INDEPENDENT loops. A communication schedule specifies the non-local data 
needed to perform local computations. The loop nest is transformed in order 
to provide a uniform mechanism for accessing local and non-local data kept 
in buffers. By changing the distribution of a given array both communication 
schedule and associated buffers to access the array would be invalidated. The 
semantics of the program may be changed and incorrect results may be com- 
puted or the program may even crash. Another danger of changing a program’s 
semantics stems from the REUSE clause which prevent redundant computation of 
inspector phases. Array redistribution could invalidate communication schedules 
and, therefore, also REUSE clauses. Although it is possible to recalculate the work 
distribution and resume execution of a loop nest, VFC and SPiDER currently 
disallow array redistribution during execution of a DO INDEPENDENT loop. 

VFC provides SPiDER with important information (included in the HPF 
symbol table) to decide whether redistribution is allowed or not. Currently array 
redistribution is allowed based on the following constraints: 

1. An array is associated with the DYNAMIC directive. 

2. An array is not an alignee or an align target. 

3. A breakpoint is set outside a DO INDEPENDENT loop nest, 

4. Distribution driven compiler optimizations are turned off. 

VFC determines these constraints and provides them to SPiDER through 
the HPF symbol table. The conditions are evaluated by the debugger at break- 
points and, depending on the result, a programmer is permitted to change the 
distribution of an array or not. 

3.2 Graphical User Interface 

SPiDER’s graphical user interface comprises a debugger window (see “SPiDER 
window” in Figure 2) which consists of several frames: task frame, source code 
frame, state frame, output frame, and command frame. The source code frame 
shows the source of the currently debugged program. A green arrow (see Fig- 
ure 2) points to the current statement where the debugger stopped all associated 
processes. Breakpoints are marked by red STOP icons. A programmer may click 
on a code line, variable name, or a breakpoint marker which pops up a menu 
offering all possible debugger commands for the given selection. For instance, if 
a line or a variable is selected, the menu will allow to set a breakpoint or to print 
the type or contents of the variable. If a breakpoint marker has been selected, 
the menu will enable deletion or modification of this breakpoint. 

The task frame tabulates the list of processes (tasks) which currently exe- 
cute the program shown in the source code frame. All commands invoked by a 
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programmer under the debugger window will be applied to all selected processes 
in the task frame. In addition, there are also global commands that impact the 
entire application, e.g. a global stop. 

The state frame displays the back-trace of procedure calls for all processes 
that are currently stopped. The command frame (see “SPiDER window” Figure 
3) enables the programmer to enter debugger commands. The output frame 
shows various debugger output as a result of debugger commands entered by a 
programmer. In this frame SPiDER, for instance, outputs array values or data 
distributions. 

A single debugger window is often sufficient and most convenient for debug- 
ging SPMD data parallel programs where all processes execute the same code. 
However, in the case where different processes are executing different parts of a 
program at a given time, it is very useful to simultaneously visualize all source 
code frames. Among others this feature may be useful if pure procedures are 
called in DO INDEPENDENT loops. Moreover, a coarse grain view of the entire pro- 
gram can be shown in one window, whereas a specific process could be debugged 
in a second window. 

SPiDER has been designed to allow multiple debugger windows each of which 
may be associated with an arbitrary set of processes. 

3.3 GDDT: Graphical Data Distribution Tool 

GDDT [14] is used by SPiDER to visualize distributed arrays and their corre- 
sponding processor arrays. The development of GDDT has been largely driven 
by the needs of SPiDER, however, as of today it is a tool that can be used for 
other systems as well. 

GDDT ( Graphical Data Distribution Toot) has been designed for visualization 
and manipulation of distributed data structures which comprises the following 
features: 

— visualization of data distributions, 

— animation of redistributions histories, 

— display of statistical information about data distributions, 

— visualization of array values. 

4 Applications 

SPiDER has been fully implemented with all the functionality described in this 
paper. SPiDER is currently based on DETOP version 1.1, GDDT version 1.1, 
and VFG version 2.0. SPiDER currently runs under Sun Solaris 7. VFG generates 
message passing programs based on MPI library mpich 1.1.2. In this section we 
present two experiments in order to examine the usefulness of SPiDER which 
includes: a system for pricing of financial derivatives [6], developed by Prof. 
Dockner’s group at the University of Vienna, and a system for quantum me- 
chanical calculations of solids [3] developed by Prof. Schwarz and his group at 
the Vienna University of Technology. 
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4.1 Pricing of Financial Derivatives 

The pricing of derivate products is an important field in finance theory. A deriva- 
tive (or derivative security) is a financial instrument whose value depends on 
other, so called underlying securities [12], Examples are stock options and vari- 
able coupon bonds, the latter paying interest rate dependent coupons. The pric- 
ing problem can be stated as follows: what is the price today of an instrument 
which will pay some cash flows in the future, depending on the development 
of an underlying security, e.g. stock prices or interest rates? For simple cases 
anal 3 dical formulas are available, but for a range of products, whose cash flows 
depend on a value of a financial variable in the past - so called path dependent 
products - Monte Carlo simulation techniques have to be applied [6]. By utilizing 
massively parallel architectures very efficient implementations can be achieved 
[13]. For a detailed description of the technique implemented see [6]. 

The group of Prof. Dockner at the Department of Business Administration, 
University of Vienna, developed the pricing system [6] as an HPF application. 
VFC was used to parallelize the pricing system and SPiDER to debug and to 
control the numerical behavior of this application. 

In Figure 2 we show a snapshot of the SPiDER debugging session with the 
pricing system stopped in procedure TRAVERSEJDISCOUNT at a specific path in 
the Hull and White tree. The generated cash flow values are stored in array 
VALUE which consists of 5904 elements. 

SPiDER provides a multiple process view of the pricing system. A program- 
mer can either monitor and control a single process or inspect the entire pro- 
gram. SPiDER displays a list of all processes that are currently executing the 
program. A programmer can switch among them or select a group of processes 
based on which debugger commands can be invoked. In the source code frame 
the current execution point is displayed with several breakpoints set in pro- 
cedure TRAVERSEJDISCOUNT. When execution reaches a breakpoint, either the 
processes for which the breakpoint has been set or all processes of the program 
are stopped. The state frame shows the current backtrace of procedure calls. It 
can be seen that the main program starts procedure au which in turn invoked 
procedure TRAVERSEJDISCOUNT. Window “Processor Array” shows processor ar- 
ray PR(1 :4) with two processors PR(2:3) selected by a programmer. Window 
“Array Values” displays the element values for VALUE(2940 : 2970) . The value 
range is between 14.19 and 89.63. Window “Data Array” shows the mapping 
relationship between processor and selected array elements. 

4.2 Quantum Mechanical Calculations of Solids 

A material science program package called WIEN97 [3] has been developed by 
the group of Prof. Schwarz, Institute of Physical and Theoretical Chemistry, 
Vienna University of Technology. Wien97 (calculates the electronic structure of 
solids) is based on density functional theory and the LAPW method which is 
one of the most accurate methods to investigate theoretically the properties of 
high technology materials. 




220 T. Fahringer et al. 



I^j 


SPiDER window 


1 - n 


1 Fil® Global Uindous 




Help II 



212 


do path count = 1, n 


213 








Breakpoint ttl at mc.hpf:214 (enabled) 






pi, p2, p3, p4 




ftetio 


: stop task 


214 




path = random path simple<tnl,m2,J start) 


215 






216 




217 






218 




if < .not.bondJfoption) then 


219 




valueCpath count) = discount simpletfix. 


220 






221 






222 




ualue(path count) = discois^t optionCI 


223 




*■ ml, m2, factor at(f 


224 




& bondXf inal ,bond2;( 


225 






226 








Breakpoi 


t *2 at mc.hpf:227 (enabled) 




Tasks 


pl, p2, p3, p4 




ftctlo 


: stop task 




Stopped; 










227 




V = V (value(path count)) 


228 






229 


end do 


230 


1 final 




231 












J 



Processor Array 



State 



State of task pi; 

State; STOPPED 
Calls: 

trawerse_discount at me 
au at au.hpf:74 
main at au. hpf;18 
_start <no line info) 



P« 


— ry 


Output 





Type of value is; 

Distributed onto 
[pi, p2, p3, p41 

REflL*8 , pointer ;; (1;5904) 



> print distributed value<l;5) 

Value of walue<l;5> is; 

[pll 

arraa<5> = 

40.1104373025348, 30.4020472216342, 8; 
14.1881568736104 



> view valuf 



Array Values 



2 (of 4) targets shown - 




PE: |PR(3| 


Select 1 Clear 1 


1 


SeleclAII | 


Clear All 



ilue(2940 2970)- 



<i 



/2i., — 

i — • ' _ 



WPf_block 

2940- 29T0> 



Data Array 



-value(2940:2970)- 



-0 (of 31 ) targets shown — 



- B 1_ CI5 CH K 



Fig. 2. Inspecting the pricing system under SPiDER 






On Using SPiDER 221 



Processor Array 



DAAV View Controls 



- Display Settings 

I~ show Axes I” show Indices 

r~ showOutiines _| show only visible values for color 
_| draw Grid 



r Spectral Distorfion- 




- Index Settings; 

Ticks: pi” PH Scale: fi” UJ 



-Segment Settings: — 
left right 

top bottom 

\/ hide 

Ok I Cancel 



Numb.: | 


3 LU 


Scale: [i 


IJJ 



rPR(1:4|- 



•4(of4)targelsshown- 



rh(1:50,1:50] 



-0 (ot 2500) targets shown - 





- Flags 

_| Dither 
r~ Depth Test 
_l Smooth Points 
_| Smooth Lines 
r~ Smooth Poly^ns 
_l Gouraud Shading 
r~ Lighting 



p Culling 

None V FronlFace 
v BackFace v FronlandBack 

r Polygon Mode 

FILL ^ 

Point Size: p 
Line Width; | -j 



p Light Settings 








Global Ambient 


Red l“ 


JJ 


I 






JJ 










I 




Blue l“ 


JJ 


I 




Ok 


J 


Apply I 


Cancel 





SPiDER window 




Fig. 3. Inspecting the HNS code under SPiDER 







222 T. Fahringer et al. 





Borders: |" 
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One of the most computational intensive parts of WIEN97 comprises setting 
up the matrix elements of H and S, which are complicated sums of various terms 
(integrals between basis functions). A large fraction of this time is spent in the 
subroutine HNS, where the contributions to H due to the nonspherical potential 
are calculated. In HNS radial and angular dependent contributions to these ele- 
ments are precomputed and condensed in a number of vectors which are then 
applied in a series of rank-2 updates to the symmetric (hermitian) Hamilton 
matrix. 

H which is the main HNS array has been distributed CYCLIC [If] in the sec- 
ond dimension onto the maximum number of processors (HPF intrinsic function 
NUMBER_0F_PR0CESS0RS) - that are available on a given architecture. In order 
to achieve good work distribution, CYCLIC distribution has been chosen due to 
triangular loop iteration spaces. Figures 3 and 4 show two snapshots of an HNS 
debugging session under SPiDER. The parallel HNS version has been executed 
on a cluster of SUN workstations. Similar as for the pricing system, WIEN97 
requires sophisticated support to visualize distributed arrays and the mapping 
relationship of data to processors. 
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Window “Data Array” in Figure 3 shows cyclic distribution of array H of the 
HNS code. The individual colors of array elements emphasize the owning processor 
in window “Processor Array”. GDDT’s rich set of display customization (slicing, 
scaling, and rotation) enables exploration of the view of array H. The graphical 
representation of distributed data is complemented by the visualization of array 
values. The global view of array values (see window “Array Values”), allows a 
programmer to quickly identify array elements with a possible inaccurate value. 
By incorporating GDDT’s ability to store various array snapshots the position 
in the source code where an inaccurate value has been assigned can be found 
quite quickly. 

Among others SPiDER has been used to locate an erroneous initialization of 
array H in the HNS code. All elements of H should have values in the range between 
0 and 1. Moreover, only the lower left triangle of H should have values different 
from 0. Window “Array Values” of Figure 3 clearly displays array element values 
above 1 and also shows that array H is not triangular. Several array snapshots 
have been made which quickly enabled the programmer to detect that this bug 
was caused by an initialization procedure. Figure 4 shows the values of array 
H after eliminating this bug. The upper two windows show array elements at 
different iterations of a timing loop. The differences in values can be visualized by 
another feature of GDDT and is shown in the lower lower left window. Gomparing 
array values at different execution points again allows to control the numerical 
behavior and in particular the convergence rate of the underlying algorithm. 

5 Related Work 

Despite of many activities in parallel software development, debugging parallel 
applications in particular debugging HPF applications has not been adequately 
supported so far. 

There are some systems that visualize data distributions (e.g., HPF-Builder 
[16]) at compile time which are unable to show intricated details of dynamically 
changing data distributions. One of the most advanced systems in this field is 
DAQV [10], which has been designed for visualization of HPF programs. It is 
not a debugger by itself but rather a framework for accessing and modifying 
data at the run time in order to simplify visualization and computational steer- 
ing. CUMULVS [Collaborative User Migration User Library for Visualization 
and Steering) [9] is a software framework that enables programmers to incor- 
porate fault-tolerance, interactive visualization and computational steering into 
existing parallel programs. An experimental HPF debugger Aardvark [15] from 
DIGITAL is the most advanced system, that addresses many of the challenges 
involved. Aardvark introduces the concept of logical entities (an abstraction that 
exists within the debugger) that group together several related physical entities 
and syntesize a single view or behavior from them. Support for debugging HPF 
programs in most existing debuggers (e.g., FDT [5], TotalView [7]), is based on 
providing a display of the source code and global data visualization for view- 
ing entire arrays and array segments allocated across processors. PDT supports 
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global data visualization and replays for race conditions at the message passing 
level. TotalView provides process groups which are treated more like sets for 
set-wide operations than like a synthesis into a single logical entity. As a result 
no unified view of the call stacks exists. 

6 Conclusions and Future Work 

In this paper we have described SPiDER which is an interactive, source-level 
debugging system for both regular and irregular High Performance Fortran pro- 
grams. SPiDER combines a base debugging system for message-passing programs 
with a high-level debugger that interfaces to an HPF compiler. A sophisticated 
visualization system has been developed and integrated into SPiDER for data 
distribution steering and animation as well as visualization and comparison of 
array values. The main novel features of SPiDER are the following: 

— Besides regular applications SPiDER also supports irregular codes with 
highly dynamic behavior including indirect array accesses. 

— Arrays can be dynamically redistributed at well-selected execution points 
which are controlled by the underlying compiler and SPiDER. 

— Convenient facilities to navigate through distributed data arrays and logical 
processor arrays are provided with an emphasis on the mapping relationships 
between data and processors. 

— An automatic replay feature enables the user to browse and replay array dis- 
tribution sequences, which supports the examination of data redistributions 
during execution of a program. 

— Array snapshots can be taken to store all values of an array at a specific 
execution point. Sophisticated visualization technology has been developed 
to examine and compare array snapshots which, for instance, enables the 
observation of the numerical behavior (e.g. convergence rate) of applications. 

— The quality of data distributions can be examined using a load diagram, 
which visualizes how many data elements have been mapped to each process 
in a parallel program. 

SPiDER combines the most useful capabilities of many existing debuggers 
and provides novel visualization and data distribution steering functionality for 
both regular and irregular data distributions. 

In future work we will enhance SPiDER by presenting a single control flow of 
a program being debugged instead of a multiple process view. Moreover, we plan 
to extend existing SPiDER technology to support metacomputer applications 
written in JavaSymphony [8] based on OCM which is a distributed monitoring 
environment. 
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Abstract. This article presents first steps in creating intellectual trans- 
lator from Petri Net notation to C-| — f. For this case Petri Nets have been 
adopted for programming needs and became capable to include func- 
tional programming commands and operators. Properties and operations 
on Coloured Petri Nets and Compositional Petri Nets were implemented 
in adopted version of Petri Nets, so that they could use advantages of hi- 
erarchical programming and data types. Translation to C++ procedure 
has been elaborated by a classical scheme: fragmentation, optimization 
and assembling, - and implemented in an experimental translator from 
Petri Nets to C++. 

Keywords: Parallel systems, distributed systems, Petri Nets, parallel 
programming, computers clusters, supercomputing. 



1 Introduction 

On current state of computer evolution, computer clusters and supercomputers 
are widely used for solving not only scientific tasks but also business, economical 
and practical ones. That tendency is growing especially because of a progress 
in computer networks that makes it possible to manufacture clusters from ordi- 
nary components even at home. But programming of computer clusters differs 
from programming of stand alone computers. This is a quite difficult problem 
because a programmer has to decompose an algorithm on consecutive interac- 
tive processes and because parallel programs may have specific types of unusual 
errors. There are some models (shared memory, client-server), standards (MPI, 
PVM, OpeuMP) and tools (HPF,HPC, Norma, T-System,DVM) for simplifica- 
tion parallel programming and improvement of parallelism degree in programs. 
Therefore we can say that programming of clusters is a job for highly skilled 
specialists. 

Petri Nets is well known as a formalism for description and modeling of par- 
allel and distributed systems. They are very attractive for application in parallel 
programming. Native visual representation, composition operations, interaction 
interfaces, explicit and implicit parallelism of different Petri Net dialects are 
useful for describing parallel programs, that are sometimes missing in other pro- 
gramming languages and tools. It is also useful to employ formal methods of 



V. Malyshkin (Ed.): PaCT 2001, LNCS 2127, pp. 226-231, 2001. 
@ Springer- Verlag Berlin Heidelberg 2001 




Experimental Version of Parallel Programs Translator 227 



analysis accumulated by the Petri Net theory. However the severe problem is 
that the Petri Net formalism is naturally parallel whereas computational units 
are programmed with the aid of sequential instructions. In this article we present 
an approach to translation of parallel constructs of Petri Nets into sequential in- 
structions of functional language like C++. We also outline implementation of 
this approach in experimental version of translator. 

2 Petri Nets Adaptation for Programming Needs 

Petri Nets and their extensions [1] are usu- 
ally based on set-theoretic notation, that is 
quite different from functional programming 
language semantics, so we were looking for an- 
other notation that could join together set- 
theoretic terms with programming construc- 
tions, we had found that XML suites this best. 

In order to produce appropriate notation we 
got definitions of consecutive and parallel 
composition operations from Compositional 
Petri Nets [4], places, arcs and transition at- 
tributes from Coloured Petri Nets [5] and type 
definitions, variable declarations, commands 
and operators from functional programming 
languages (C++). 

We defined Petri Net notation to consist 
of elements, that compose a structure of net, 
and a number of inscriptions corresponded to 
each element of the structure. Inscriptions are 
divided into two types: 1) visual inscriptions, that describes structure of elements 
size, positions, and naming, 2) qualitative, that describes elements behaviour 
in operations. The structure could consist of either sets of places, transitions 
and arcs (for plain Petri Net), or sets of nets and operations between them 
(for hierarchical Petri Net). Plain Petri Net elements could have qualitative 
inscriptions of the following types: labels, token types, tokens, substitution and 
expression (for arc), predicate (for transition), SAP (for places that form s-access 
points), TAP (for transitions that form t-access point). In hierarchical Petri Net 
only net elements could have qualitative inscriptions of types TAP and SAP, that 
represents inner net design in composition operations. We define token types to 
be C++ programming types and user defined programming types. Substitution, 
expression and transition are defined to be specified in C++ code too. It is not 
obligatory to use C++ semantics, that could be of any functional programming 
language, but having an aim to get further C++ code from a specification in 
terms of Petri Nets it is very convenient. 




Fig. 1. Exampie of Petri Net 
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3 Translation from Petri Nets to CH — h 

Typical program structure for solving problems with big amount of data is di- 
vided in next parts: 

1. Gathering input data. 

2. Decomposition of data between computational units. 

3. Processing of data in computational units. 

4. Assembling of data from computational units. 

5. Presentation of answer to user. 

These parts are quite 
different by type of data 
they are used in program 
algorithms. When using 
Petri Nets for description 
of program it is desirable 
to distinguish these parts 
in order not to overload 
description with redun- 
dant data and to reduce 
program definition by re- 
moving repeating parts. 
So we are thinking of real size programs in Petri Nets to consists of algorithms 
dealing with different aspects of data. Some of algorithms are processing data, 
and they could be called as pure algorithms, others that manipulates with data 
distribution and processes migration between computational units, could be 
called as templates. With such a scheme both templates and pure algorithms 
will not depend upon size of data, so that analysis and translation of them could 
be done by an appropriate translator. We had developed operations of decom- 
position and optimization that are preparing Petri Nets for such translation to 
C++. 

Decomposition operation divides Petri Nets into a set of interacting sequen- 
tial subnets of maximum length. The sequential subnet is a Petri Net, that saves 
number of tokens and interacts with other subnets only through the boundary 
elements, so that interaction is between more than two subnets. 

Algorithm of decomposition is based on two concepts. First is the scope of 
token visibility: having token type and initial position we define it as set of 
places, arcs and transitions that could be reached by the token in all possible 
conditions. Second is the scope of token best visibility: having token type and 
initial position we define it as set of places, arcs and transitions that token always 
pass through in all possible conditions. To decompose net we need: 

1. Find for each token in the net both scopes of visibility. 

2. Intersect all tokens best visibility scopes and for each region form decompo- 
sition subnet. 

3. Extract best visibility scopes from Petri Net. 
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Fig. 3. Example of program template for computational grid 



4. Define new tokens for places in previous tokens visibility scopes outgoing 
from best visibility scopes. 

5. Repeat from first point until there are tokens in the net. 

The main property of decomposition operation is an existence and uniqueness 
for any given Petri Net. So that after decomposition next operation - optimiza- 
tion, will deal with standard input data. 

Optimization operation algorithm based on partial enumeration of all possi- 
ble assemblings of decomposition subnets. Each assembling (or even assembling 
way) is evaluated by heuristic rules, and the best assembling is the final result 
of operation. In case of several equal estimated assemblings only one, that was 
found first will be the result. We had define three heuristic estimations: 

1. for maximum length process, (the point is client-server application model); 

2. for given number of approximately equal length processes, (the point is spe- 
cific cluster architecture); 

3. for minimum number of processes (the point is local area network applica- 
tion); 

The aim of optimization operation is minimization of subnets number for exe- 
cution in accordance with heuristic rules. This operation prepares for translation 
the specification of distributed algorithm with obvious processes and interactions 
between them. 

A translation of optimized program representation from Petri Nets to C++ 
is a routine algorithm that could be interesting only from program development 
point of view. 
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4 Experimental Translator of Parallel Programs 
from Petri Nets to CH — h 

The System consists of Petri Net Editor with a graphic user interface, three 
console applications [Decomposition^ Composition and Translation), and two 
libraries [Base Library and Petri Net program template). 

Base Library is a set of C++ 
classes that is needed for using in 
the others components. It imple- 
ments basic data structures such 
as list, array, tree, set, multiset and 
Petri net data structures. Together 
with these structures a number of 
basic operations to work with the 
structures have been defined there. 

Petri Net Editor is used for Pet- 
ri Net program specification and 
editing. It permits to work that 
could with many nets in the same 
time and programs you are design- 
ing that could be writing and read- 
ing to and from the file. To setup 
translation parameters the option 
dialog is used, in which you enter 
console tools location and proper- 
ties (???). There are three transla- 
tion buttons on a toolbar menu - 
Decomposition, Composition and 
Translation. To translate a program you need to push their consequently. 

Decomposition, Composition and Translation tools implements decomposi- 
tion, composition and translation phases accordingly. All these components work 
independently from each other and do not need a user interaction. They could 
be running both from the Petri Net Editor directly and command prompt man- 
ually as a stand-alone tool for producing certain translation phase. It is became 
possible because our system uses files for subsystems data communication. De- 
composition and Composition write their results to file, stored by the Petri Net 
Editor and Translation tool creates C++ project in a directory specified by the 
user. 

Petri Net file format is similar to XML [7] and can be easily converted to it, 
when standard XML specification of Petri Net will be completely defined. 

Important part of the system is Petri Net program template. It is a stand- 
alone subsystem that represents a framework for Petri Net parallel programs. 
It implements all Petri Net abstractions in terms of C++. We develop special 
C++ classes to represent Petri Net processes, tokens, access points and etc. All 
Petri Net processes communications are handled by access points mechanisms. 
To provide processes interaction we have used MPI library. 




Fig. 4. System modules 
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In translation of Petri Net to C++ we have followed next principles: 

— Subnet after composition phase is an independent process. 

— Token is an independent data structure (class) describing a process state. 

— Incoming arc in a transition is a separate class that prepares data for tran- 
sition firing. It substitutes token class to transition predicate. 

— Outgoing arc from transition is a special class that runs an expression and 
produces new tokens. 

On the final phase of translation we inherit our generated program classes 
from the template and, hence, get a parallel C++ program that is designed in 
terms closed to Petri Net parallel mechanisms. 

5 Conclusion 

In this paper we investigated possibilities of Petri Net formalism in description 
parallel programs. An experimental version of translator from Petri Nets to 
C++ showed a possibility of creating efficient program environment based on 
Petri Nets. We can point out some directions for further work: 

a) Program development with templates described by Petri Nets for clusters 
and supercomputers. 

b) Application of existed analysis methods to program specifications. 
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Abstract. In this paper we propose a new separation of the processor 
units to avoid interunits communications for instruction dispatch, mem- 
ory accesses and control flow computations. The motivation comes from 
the increasing importance of interchip signalling delays. The technique 
consists in separating the instruction set into types, e.g. integer, floating 
point and graphic, and the die into corresponding units, each includ- 
ing a private pc, an instruction cache, a prediction unit, a branch unit, 
a load/store unit and a data cache. Every type is able to fully handle 
data and pointer computations as well as typed address pointers. Hence 
the integer machine, the floating point one and the graphical one are 
very independent machines requiring no inter-machine communications. 
We justify our proposal by showing that the main communication paths 
can be highly reduced in length. We show that the fetch path length 
can be divided by 2, the data load path length can be decreased of 1/3 
and computation units interconnection paths can be highly simplihed, 
serving only for conversion purpose. 



1 Introduction 

Todays processors are built around 5 units: one instruction memory unit in- 
cluding an instruction cache, a fetch mechanism and a prediction subunit; three 
computing units (integer, floating point, graphic: the graphic unit can be fur- 
ther subdivided into an integer vector unit and a floating point vector unit); one 
data memory unit including load and store buffers and a data cache. Figure 1 
left part shows how such a processor is organized on a die. The die is divided 
into three equal area parts, one devoted to the instruction memory unit, another 
devoted to the data memory unit and a third one containing the core, i.e. the 
three computing units. The size of the instruction cache has a direct impact on 
the processor cycle, with today a pipelined fetch unit delivering one fetch per 
cycle with a two cycles fetch latency [5,8]. 

Table 1 gives a cacti [2] estimation of the cache access times for different 
cache sizes and chips technologies (the times given are the data side ones in 
nano seconds, without output driver; computations parameters are: 128 output 
bits, 32 address bits, blocks of 32 bjdes for caches up to 128KB and 64 bjdes 
for larger caches, direct mapping for caches up to 8KB, associativity 2 for 16KB 
caches, 4 for 64KB caches and 8 for larger caches; for each line, the technology 
applied is the left one: for example on line one, the technology is 0.35/x). 
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Table 1. Caches data path times 



year 


techno. 


1KB 


4KB 


8KB 


16KB 


64KB 


256KB 


1MB 


cpu cycle 


1998 


0.35/0.25p 


0.97 


1.15 


1.27 


2.03 


2.99 


4.46 


7.93 


500/750Mhz 2/1. 33ns 


2000 


0.25/0.18p 


0.70 


0.82 


0.91 


1.45 


2.14 


3.18 


5.67 


l/1.5Ghz l/0.66ns 


2003 


0.18/0.15p 


0.50 


0.59 


0.65 


1.04 


1.54 


2.29 


4.08 


2/3Ghz 0.5/0.33ns 


2005 


0.15/0.13p 


0.42 


0.49 


0.54 


0.87 


1.28 


1.91 


3.40 


4/6Ghz 0.25/0.16ns 


2008 


0.13/0.1p 


0.36 


0.43 


0.47 


0.75 


1.11 


1.66 


2.95 


8/12Ghz 0.12/0.08ns 


2010 


0.1/. ..p 


0.28 


0.33 


0.36 


0.58 


0.86 


1.27 


2.27 


16/...Ghz 0.06/. ..ns 



If we assume a clock cycle doubling for each technology step, we can see that 
in 1998, a full fetch including cache block read and instruction selection and 
slotting, in a typical 16KB cache was possible in a cycle (as in the DEC 21164 
[4]). In 2000, a fetch in a typical 64KB cache in a 600Mhz cpu (as in the DEC 
21264) had a two cycles latency with a cache block read time of one cycle, hence 
the necessity of pipelining the fetch path (if the cache read time is larger than a 
cycle, wave pipelining [7] can be used to provide a cache throughput of one read 
per cycle as long as accurate addresses can be provided at the same rate). In a 
IGhz 0.18/r cpu, the first level cache must be reduced to 8KB to allow a single 
cycle read (as in the Pentium III [12]; the Pentium 4 [13] and the AMD Athlon 
[1] have larger LI caches accessed in 2 cycles). 

Moreover, driving the fetched instructions to the computing units is also a 
concern in todays cpus. Eor example the Pentium 4 dedicates a full cycle for 
this purpose. Tomorrow, in 2003/2005, in a 3Ghz cpu including an 8KB LI 
instruction cache, if the path connecting the fetch machine to the computing 
units is not shortened, the drive time on its own will take two full cycles (to be 
added to the two cache read cycles). 

This paper gives two hints to reduce fetch and load/store read path lengths 
in order to cope with the increased drive time. This paper is linked to the various 
works concerning processor clustering [6,11]. The main difference between the 
present approach and previous proposals is that we modify the ISA (Instruction 
Set Architecture). The consequence is that clustering is not only performed on 
the set of functional units but also on the fetch path and on the load/store path. 

The paper is organized as follows: next section presents a two levels cache 
organization allowing simultaneous accesses to LI and L2. Section 3 describes a 
new die layout clustered by data types. Section 4 explains which are the necessary 
ISA modifications. Section 5 discusses the paper proposal and compares it to the 
known works on cpu clustering, multithreading and on-chip multiprocessors. 

2 A Two Levels Simultaneously Accessed Cache 

On the die example presented on Eig. 1 left part, the fetch path is labeled ‘f’. 
The fetch critical path goes from the center of the fetch unit (labeled icache on 
the figure) left edge where the pc is assumed to be located. It travels (say) to the 
fetch unit upper right corner where the fetched block is assumed to reside and 
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die edge 1 die edge 1 

Fig. 1. Two processor dies 



the fetched graphical instructions are propagated to the graphic unit (labeled 
gu) entrance which is assumed to be at the center of the unit left edge. This 
5/ /3 path, where I is the die edge length (//2 + //3 + 2//3 + / /6) , corresponds to 
a worst case, a best one having an //3 length (integer instruction placed at the 
center of the cache area). In a cpu today, this work is performed in two cycles, 
the first one being devoted to cache read (length 4//3) and the second one to 
instruction slotting (length //3 and some decoding and selection logic). In such 
a cpu, the cycle is fixed to leave enough time for cache read, i.e. for a signal 
to cross a length 4//3 distance. Because the die edge I should be increased ([3]; 
20mm in 2000, 28mm in 2005 and 40mm in 2010), as long as the cache read path 
is linked to I, the cycle time, if equal to the cache read time, should increase too 
(in other words, if we use the edge length increase to fill the die with cache, we 
increase the cycle time) . 

In the case of a memory access, an inter-computing-units communication 
takes place. The address computation is performed in the integer unit while the 
access itself may concern another unit register. The load path is labeled ‘m’ on 
the figure. It is longer than the fetch path because it involves a round trip. On 
the die example, the worst load distance is 21 starting at the center of the integer 
unit right edge (load address), going up to the data cache upper right corner 
where the block to be loaded is assumed to reside and going back to the center 
of the loading computing unit left edge, which is assumed to be the graphic one 
in the worst case (//2 + //3 + 2//3 + //6 + With today signalling delays, 
loading has at least a two cycles latency. 

A first attempt to shorten the fetch and the load/store paths is to separate 
each cache in two levels. The die can be organized as shown on Fig. 1 right 
part. It includes a core corresponding to the left die, surrounded by two level 2 
caches. The die edge is I (scaled to the technology process) and the core edge is 
I', with a wire delay of 1 cycle for a 4/'/3 length (we link the cycle time to the 
I' length which should decrease as technology is refined). Such a design is close 
to the DEC 21164 or the Pentium III and 4 (except that the L2 cache in these 
processors is a unified cache). This two levels cache organization with a small 
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LI cache was abandoned by DEC because the LI cache had a too high miss rate 
and the first level miss detection incurred a too high penalty for the second level 
access time. It is still in use in the Pentium III but is also abandoned both in 
the AMD Athlon and in the Pentium 4 with a 64KB cache for the former and 
a 12Kops trace cache for the latter (the trace cache contains fixed length micro 
operations; it has a capacity roughly equivalent to a 64KB cache). 

But as table 1 shows, such a cache capacity for 2003 clock speed will lead 
to a three to five cycles read latency (2 to 3 Ghz). On the other end, a cost 
effective two levels hierarchy starting with a small capacity LI cache requires a 
reduced L2 access time. This can be done by performing both accesses (LI and 
L2) simultaneously (the L2 access is stopped if the Lf hits) which implies keeping 
the L2 instruction cache distinct from the L2 data cache to avoid contention. 
In such a design, the L2 instruction cache can preferably be a victim cache as 
in the AMD Athlon (in case of a miss, the Lf cache loads the missing line and 
transfers to L2 the replaced line). In such a way, Lf miss detection does not add 
any penalty to L2 access time. In this design, a fetch that hits in the first level 
crosses a length 5L/3 with 4/'/3 in a single cycle. For a L2 fetch, the distance is 
21 — L/3 [21 — I' for L2 access and 2L/3 to propagate L2 block to (say) the fpu). 
If we assume I = 21' as drawn on the figure, the L2 read (3/Q can be performed 
in 2 cycles (one more cycle is needed to slot instructions in their units, as in 
the case of a first level hit). As I' is decreased and I is increased, L2 reads may 
last much more than two cycles in finer technologies. When the gap becomes too 
important, it may be necessary to separate L2 in two levels, having three levels 
on-chip. However, by keeping the Lf caches constant in capacity, the V length 
can scale with the technology allowing the cycle to scale too. 

This is true only if the core does not increase [I' is not increased for archi- 
tectural reasons). Unfortunately, actual trends with higher superscalar degrees 
is to add more and more functional units, along with new specialized ones such 
as the SIMD operators and bigger branch prediction hardware. Hence I' should 
not remain constant, implying that signal propagation inside the core itself may 
become a problem. 

3 A New Die Layout 

Figure 2 shows an alternative die organization that has three main differences 
with the layouts on Fig. f: the caches first level have been duplicated, each copy 
being local to a computing unit; the fetch subunit and the prediction subunit 
have been duplicated with each copy placed on the left edge of each Lf instruction 
cache (there are three pc); eventually, the branch unit has been duplicated in 
every computing unit (the duplications are not apparent on the figure except for 
Lf caches). 

The ‘f’ path starts from the center of an Lf unit left edge, goes up to the 
same Lf unit upper right corner and ends at the center of the corresponding 
computing unit left edge. The length is 21' ji, due to the locality of the fetch 
(L/6 + L/3 + ^V®)- This is less than f/2 of the Fig. f right die ‘f’ path length. 
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1,2,3: fpu,iu,gu instruction cache level 1 
4,5,6: fpu,iu,gu data cache level 1 

Fig. 2. A new die layout 



In case of an LI miss, the ‘f’ path length starts from the center of the LI (say) 
graphic unit [gu) left edge, travels up to the L2 upper left corner (say), comes 
back to the starting point and ends at the center of the graphic computing unit 
left edge. Hence, it has length 21 (2 * (L/3 + //2 + (I — I') 12) + /'/3) which is L/3 
longer than Fig. 1 right die ‘f’ path (the difference comes from the starting point 
of the fetch which is equidistant from each corner of the right die on Fig. 1 and 
but not on Fig. 2). 

The ‘m’ path starts from the center of any computing unit right edge, goes 
to the corresponding LI data unit upper right corner and travels back to the 
center of the same computing unit left edge (F/3 + F/6 + 21' /‘i + F/6). This is 
a length 4F/3 path (1/3 less than the right die ‘m’ path which has length 21'). 
In case of an LI miss, the L2 access path has length 21 + 2F/3 (2 =i< (F/3 + (^ — 
F)/2 + r/3 + //2) + F/3). This is 2f/3 longer than right die ‘m’ path. The reason 
is that the address computation is local to the loading unit in the die on Fig. 2 
instead of being performed in the integer unit in the right die on Fig. 1. 

To fairly compare the two layouts, we must take in account that the LI 
caches are partitioned in the new layout. This means that each unit has a private 
part that is 1/3 of the total LI cache size. The cache miss rate is affected by 
this capacity reduction. Let ri be the LI miss rate on Fig. 1 right die and r 2 
the LI miss rate on Fig. 2 die. The average fetch length on Fig. 1 right die is 
5/'/3 !|< (1 — ri) + [21 — F/3) =i< r\ and the average fetch length on Fig. 2 die 
is 2F/3 !|< (1 — T 2 ) + 2/ !|< T 2 . The latter is better than the former as soon as 
/ < (3 — 6ri — 2r2)/(6(r2 —ri))V. If for example = 0.08 and r 2 = 0.16, the new 
layout is better than the old one if I < 4.6/' (when I gets very large compared 
to I' , the L2 time becomes dominant; in such a case it is necessary to insert an 
intermediate level between LI and L2; as LI on Fig. 2, this intermediate level 
can be clustered). 

The two layouts must also be compared for inter-unit communications. An 
inter-unit communication is involved when a conditional branch or indirect jump 
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has not been correctly predicted. In this case, the corrected continuing fetch ad- 
dress is transfered from the branch unit to the prediction subunit (the branch 
unit can either be a separate unit or as assumed on Fig. 1, part of the integer 
unit). The prediction correction critical path, labeled ‘c’ on Fig.l, can be as- 
sumed to be of length 2//3 (2F/3 on the right die), going from the center of the 
integer unit right edge to the prediction subunit, located nearby the pc register, 
i.e. at the center of the fetch unit left edge. On Fig. 2, the ‘c’ path runs from 
the center of any computing unit right edge to the center of the corresponding 
LI unit left edge. The path has a length of /'/3 which is half of the path on the 
Fig. 1 right die. 

Also concerning the branch unit, the new set of predicated instructions [9] 
can lead to predicate interunit communications (e.g. a predicate obtained from 
the comparison of two floating point numbers and used in the branch unit). 
The predicate propagation critical path, labeled ‘p’ on Fig. 1, can be assumed 
to have length 2//3 (2/'/3 on the right die), starting from the center of the (say) 
graphic unit right edge and ending at the center of the integer unit left edge 
(//6 + //3 + V ®)- Because the branch unit has been duplicated on the Fig. 2 die, 
no inter-unit predicate propagation is needed and the ‘p’ path may be removed. 



4 ISA Typing 

To allow a full separation of the computing machines, it is necessary to give 
each some instruction addresses computing facilities (branches and jumps targets 
computations) as well as some data addresses computing ones (data structures 
pointers). In such a way, a function can be fully handled by its hosting unit. 
For example, a floating point function such as SAXPY, that is made of floating 
point computation instructions, load and store ones and loop control ones, is 
entirely taken in charge by the fpu. This means that the fpu register file keeps 
the floating point arrays pointers and index as well as the floating point data. It 
also induces that the set of functional units includes an integer adder to handle 
address computations. The instruction set itself is enhanced to give to each 
computing machine proper branching, jumping, loading and storing instructions 
(with an address computation based on pointers locally available in the machine 
register file). 

The following piece of code shows what SAXPY looks like when written with 
such an instruction set. As we can see, every instruction is prefixed with ‘f’. 
This designates the fpu computing unit to be used to compute the function. The 
opcode suffix gives the operation to be performed that can be a true floating point 
one as in ‘fmul’ and ‘fadd’, a load or a store of a floating point data (‘fid’ and 
‘fst’), the computation of an integer value (‘ficlr’ and ‘fiadd’) or a comparison 
of integers in a conditional branch (‘fibtrue’). The return instruction itself is 
typed, to distinguish a floating point function return from an integer one. The 
call instruction is also typed with the prefix giving both the caller and the callee 
types (for example, the fpu has three calls: ‘ficall’, ‘ffcall’ and ‘fgcall’ to call 
respectively a floating point, an integer or a graphic function). To handle calls 
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and returns as well as predictions, two hardware stacks per unit are maintained. 
One, say RS, is the return address stack (as in actual speculative processors) 
and the other, say CS, is a continuing unit stack. 



saxpy : 



eO: 



/*x : f 1 ; 

/*for 
/* 

ficlr fl6 

fid fl7=M[fl+fl6*4] 
fmul fl7=fl7*f3 
fid fl8=M[f2+fl6*4] 
fadd fl7=fl7+fl8 
fst M[f2+fl6*4]=fl7 
fiadd fl6=fl6+l 
f ibtrue 
fret 



n:f4*/ 



/*i=0*/ 

/*x[i] */ 

/*a*x[i] */ 

/*y[i]*/ 

/*y[i]+=a*x[i]*/ 

/*y[i]*/ 



/*!++*/ 
(fl6!=f4),e0 /*loop*/ 



y:f2; a:f3; 
(1=0; i<n; i++) */ 
y[i]+=a*x[i] ;*/ 



A unit is made active when a call of its type is fetched (see Fig. 3 upper 
part; for example, if the fpu fetches a ‘ficall’, the integer unit is made active). 
The previous active unit code is pushed on top of CS (in the example, the fpu 
code is pushed on top of the integer unit CS stack; the return address is pushed 
on top of the fpu RS stack). The same unit remains active until the return 
instruction is fetched (in the example, a return from an integer function; in a 
return instruction, the target function type is not mentioned to allow returns 
to different function types). While a unit is active, the other ones are inactive 
(hence, units read one at a time in the instruction and data L2 caches). When 
a return is executed, the leaving unit pops the continuing unit from its CS 
stack. The leaving unit is disactivated (its fetch unit no more fetches) and the 
continuing unit is reactivated (its fetch unit pops the return address from its 
local prediction stack RS and restarts fetching). Disactivation and reactivation 
are performed within the same cycle, incurring no delay to switch the active 
unit. 

Branches and jumps are predicted the same way than in a speculative pro- 
cessor (with one prediction subunit per type). A bad prediction correction is 
internal to a unit except for calls and returns. When a call is mispredicted, the 
calling unit resets the called one with the corrected address. When a return 
is mispredicted, first the falsely predicted continuing unit is disactivated and 
second, the true return address designates the true continuing unit (such a dis- 
tinction requires that the OS separates the code region in typed segments; an 
alternative is to trap at fetch when a type t' instruction is fetched by a type t 
unit; this is further explained later). Such a typed coding highly relies on the 
compiler ability to recognize each function type. For example, in C, the SAXPY 
function would have the following prototype: 

void saxpy (float x[] , float y[] , float a, int n) 

This example shows that the function type itself (void) does not give the 
computing unit type. Moreover, the set of arguments is mixed. However, in this 
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a: 



;fpu is the active unit 
hcall max ;push fpu on top of CS(iu) 
;push ” a” on top of RS(fpu) 
;fpu is the active unit 
;fpu is the active unit 



max: 



iret 



;iu is the active unit 
;iu is the active unit 
;pop CS(iu); pop RS(fpu) 



u... ;u is the active unit 

u... ;u is the active unit 

u... ;u is the active unit 

u’... ;trap; u’ becomes the active unit 

u’... u’ is the active unit 

u’... u’ is the active unit 

u’... u’ is the active unit 

u... ;trap; u becomes the active unit 

u... ;u is the active unit 

Fig. 3. Typed function call and return 



example the compiler should easily decide to match the function with the fpu. 
Some other cases might be more complicate for example when the function com- 
putes separate values of different types. Then, the compiler has to analyse the 
data dependencies to separate the function code into properly typed subfunc- 
tions. The connection between the subfunctions can be realized with a typed 
call instruction or by trapping. With the trapping technique, when a unit u (see 
Fig. 3 lower part) fetches an instruction belonging to unit u', a trap occurs that 
switches the active unit from u to u' (as if a typed call would have been fetched). 
This kind of trap occurs at fetch time. It incurs a one cycle delay (a block is 
fetched and a badly typed instruction is detected in it; it disactivates the active 
unit (m) and reactivates the continuing unit [u') with the trapping instruction 
address for pc; the instruction is refetched by the unit of its type (mQ). The 
compiler should reorganize the code to minimize such types breaks (we must 
insist on the fact that mixing pointers and data does not lead to types breaks; 
as in the SAXPY example above, we can see that floating point pointers and 
indexes are handled by the fpu with its register file; for this reason, types breaks 
should be rare in the dynamic code run from our proposed ISA; as an example, 
the SAXPY code does not contain any type break). 

The case of data is different. A function accesses data memory through its 
unit private cache. For example, a function handled by the fpu accesses floating 
point data and pointers on floating point structures in the fpu data cache. If a 
datum must be shared by two units, an explicit conversion must be performed 
that is coded with an import instruction that moves a register from an external 
unit to a register of the local unit. For example, this is the case in a conversion 
function. The function (see an example on Fig. 4 with the function atof) com- 
putes the conversion and leaves the converted value in one of its data registers. 
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atof: 



iret 



;read an ascii word 

;and convert to a 

;floating point data 

;atof is computed by 

;the integer unit 

;fp data in RO, integer register 



fpfun: ... 

ficall atof ;call to the integer function atof 
fimport F0=R0;get the fp data 

; (inter-register hies transfer) 

Fig. 4. inter-units importation 



in data cache fpu data cache 





line 1 




i 


f 


i 


f 







i may be updated i may not be updated 

f may not be updated f may be updated 

line 1 is duplicated in two caches, referenced for i by in and for f by fpu 



Fig. 5. A duplicated cache line 



The caller (function fpfun) imports this register into its proper file (instruction 
fimport). The computing units are linked by a bus that only serves for this 
purpose. In such a way, a datum is typed and never has to be loaded for update 
in the cache of another type (it can be loaded as a side effect with other data 
sharing the same cache line in another cache but cannot be modified; for exam- 
ple, a structure composed of an integer and a floating point datum can be loaded 
in the integer unit data cache as well as in the fpu one; however, the integer, if 
handled by the integer unit cannot be modified by the fpu, nor can the floating 
point data be modified by the integer unit (see Fig. 5)). These restrictions ensure 
cache coherency without the need of snooping. 

This typing of the computing units is compatible with any speculative and 
out-of-order execution mode as long as the techniques employed for implemen- 
tation do not add any interunit communication except possibly some status 
signals. 



5 Discussion 

Different recent proposals have an impact on processor clustering. SMT-like 
multithreaded processors [15] have a very centralized organization, the different 
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threads being merged into a unique shared core. This is an important drawback 
of SMT design that should lead to important inter-units communication delays. 
On-chip multiprocessors [10] cluster the die into independant processors that can 
communicate only through the shared part of their memory hierarchy (that can 
be on-chip or off-chip). The main disadvantage of this type of design comes from 
the resources redundancy imposed by the share-nothing choice and the inter- 
processors communications induced by the coherency protocol for the duplicated 
levels of the data memory hierarchy. 

The Multicluster architecture [6], the Multiscalar architecture [14] and the 
Complexity-effective architecture [11] are examples of clustered single processor 
architectures. In each of them, the fetched instructions are dispatched to clusters. 
This is a first difference with this proposal. In these designs clustering does not 
concern the fetch unit, implying that the fetch latency does not get improved. 
A second difference is that we modify the ISA and they don’t. For this reason, 
the cluster to which an instruction belongs is not opcode dependent (as it is 
in the present proposal) but is determined from two contradictory aims: avoid 
inter-cluster communications (this tends to centralize the instructions into a 
single cluster) and equilibrate cluster computation load (this tends to distribute 
instructions uniformly into clusters). Instructions forming a computation chain 
are oriented in the same cluster. Different chains are alternately placed in the 
available clusters. Instructions may have their sources in two different clusters 
implying an inter-cluster communication. This occurs often enough to impose 
one or more inter-cluster communication paths. 

In fact their goal is not the same as our. They want to reduce the necessary 
core resources to allow a high superscalar degree (such as read and write ports 
on the register file). Each cluster takes in charge a subset of the superscalar 
degree (for example, a degree 16 machine could be composed of 4 degree 4 
clusters). What we want is to decrease the number of inter-unit communications 
and to reduce the paths lengths. If a high superscalar degree is needed, then each 
computation unit has to be clustered itself (in the proposed design, each unit 
when it is active fetches as many instructions as the superscalar degree allows), 
which is out of the scope of this paper. 



6 Conclusion 

In this paper, we have first pointed out that inter-units communication paths 
and mainly fetch and load/store paths now impact processor performance. The 
fetch and the load/store paths can be shortened with a reduced size LI cache. 
The LI miss penalty can be reduced if the LI and L2 caches are simultaneously 
accessed. We have also noted that even though the LI cache is kept small to 
scale with the cpu cycle, the core area increases leading to an increase in the 
drive time along the paths connecting the caches to the computing units. 

By typing the ISA, we have shown that it is possible to half the fetch path 
length. The instruction cache, the fetch unit, the branch prediction unit, the 
typed computing unit, the load/store unit and the data cache altogether form an 
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independent machine that has quite no link with the other computing machines 
in the cpu, except inter-register import. During a cycle, a single machine is 
the active machine, i.e. the one that fetches. A machine is made active with a 
function call of its type, a function return or a fetch trap. 

ISA typing should be useful to help to scale the processor cycle with the tech- 
nology in future designs. With ISA typing, it is possible to reduce the pipelines 
depths by removing some or all of the drive stages. This means that the CPI 
can be better sustained despite the dramatic cycle reduction. 
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Abstract. During the software crisis of the 1960s, Dijkstra’s famous the- 
sis “goto considered harrnjul” paved the way for structured programming, 
i.e. software development with well-dehned and disciplined organization 
of control flow. In parallel programming, a new aspect - communication 
- has an important impact on the structure and properties of programs. 
This paper shows that many current difficulties of parallel programming 
are caused by complicated and poorly structured communication, which 
is a consequence of using low-level send-recv primitives. We argue that, 
like goto in sequential programs, send-recv should be avoided as far as 
possible and replaced by collective operations in the parallel setting. 
We argue against some widely held opinions about the apparent superi- 
ority of individual over collective communication and present substantial 
theoretical and empirical evidence to the contrary. The paper overviews 
some recent results on formal transformation rules for collective opera- 
tions that facilitate systematic, performance-oriented design of parallel 
programs using MPl (Message Passing Interface). 



1 Introduction 

Nowadays, parallel and distributed systems have apparently ideal conditions for 
their development. The demand for such systems is great and growing steadily. 
Traditional supercomputing applications, Grand Challenges, require the solution 
of increasingly large problems, with new areas added recently, e.g. research on the 
human genome. The rapid growth of the Internet has given rise to geographically 
distributed, networked supercomputers [Grids) and to new classes of distributed 
commercial applications with parallelism on both the server and client side. 

Every year, bigger and more powerful systems are built. Microprocessors 
are quickly becoming faster and cheaper, which enables more processors to be 
connected in one system. New networking hardware with smaller latency and 
greater bandwidth improves systems’ scalability. Several levels of parallelism are 
available to the user: within a processor, between processors in an SMP or cluster, 
up to the parallelism among remote machines cooperating over the Internet. 

Under such a favourable combination of conditions - strong demand and 
good hardware availability - it would be natural to expect substantial progress 
in the field of parallel and distributed software. However, program development 
for parallel and distributed systems remains a challenging and difficult task. 
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One of the obvious reasons for this unsatisfactory situation is that today’s 
programmers rely mostly on the programming culture of the 1980s and ’90s, the 
Message Passing Interface (MPI) still being the programming tool of choice for 
demanding applications. The main merit of MPI is that it integrated and stan- 
dardized major well-understood parallel constructs that were proven in practice. 
This put an end to the unacceptable situation where every hardware vendor 
provided its own set of communication primitives. 

The major disadvantage of MPI - low-level communication management with 
the primitives send and recv resulting in a complicated programming process - 
has been known and criticized for years. Several attempts have been made to 
overcome this, DSM, HPF and OpenMP being the most prominent proposals. 
However, despite reported success stories, none of these approaches have ever 
achieved the popularity of MPI. 

We believe that although MPI’s main problem - low-level communication - 
was identified correctly, the chosen remedy - a complete banning of explicit 
communication statements from parallel programs - was probably not the right 
one. While simplifying the programming process, it makes the performance of 
parallel programs less understandable and hardly predictable. 

The thrust of this paper is: the problems of low-level communication should 
be solved not by excluding communication from, parallel programs altogether, but 
rather by expressing communication in a structured way. 

2 Learning From History: “Goto Considered Harmful” 

To decide what would be a better, more structured way of dealing with 
communication in parallel programs, let us turn to the history of “structured 
programming” in the sequential setting. During the 1960s, it became clear that 
the indiscriminate use of transfers of control was the root of much of the difficulty 
experienced by software developers. The breakthrough was made by Dijkstra in 
his famous letter “(/oto considered harmful” [9], where the finger of blame was 
pointed at the goto statement. The notion of so-called structured programming 
[7] became almost synonymous with “goto elimination” . 

Dijkstra’s thesis did not appear in a vacuum. By that time, the research 
of Bdhm and Jacopini [6] had formally demonstrated that programs could be 
written without any goto statements, in terms of only three control structures - 
sequence, selection and repetition. It was not until the 1970s that programmers 
started taking structured programming seriously, but even the first results were 
impressive, with software development groups reporting reduced development 
times as well as more frequent on-time and within-budget completion of software 
projects. The key to success was that structured programs are clearer, easier to 
debug and modify, and more likely to be bug-free. Newer languages like Java do 
not have a goto statement at all. 

If we wish to learn from structured (sequential) programming, we have to 
answer the question: which concept or construct plays a negative role - similar 
to that of the goto - in the parallel setting? 
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Fig. 1. Just as the indiscriminate use of the goto complicates sequential programs, 
send-recv statements cause major difhculties in parallel programming. 



As implied in Figure 1 and demonstrated from Section 4 onwards, we believe 
that send-recv statements cause problems in parallel programming. We sug- 
gest, therefore, that send-recv be “considered harmful” and be avoided as far as 
possible in parallel programs. 

3 Collective Operations: An Alternative to Send-Recv? 

What would be the proper replacement for send-recv? In our opinion, it does 
not even need to be invented: we propose using collective operations, which are 
already an established part of MPI and other communication libraries. Each 
collective operation is a particular pattern specifying a mutual activity of a group 
of processes, like broadcasting data from one process to all others, gathering 
information from all processes in one process, and so on. 

First prototypes of collective operations have been used since the 1970s. 
Languages and libraries like Minimax [19], CCL [2], PVM [10] definitely do not 
constitute an exhaustive list of such approaches. It was one of the main merits of 
the MPI standard that it combined in a uniform manner practically all collective 
operations that have been known and used for years. 

For the sake of completeness, we show in Figure 2 the main collective oper- 
ations of MPI for a group of four processes, PI to P4. 

Two upper rows of Figure 2 contain collective operations that specify pure 
communication (e.g. broadcast, gather, etc.); operations at the bottom of the 
figure, like reduce, perform both communication and computation. The binary 
operator specifying computations (+ in Figure 2) is a parameter of the collec- 
tive operation: it may be either predefined, like addition, or user-defined. If the 
operator is associative, the collective operation can be implemented in parallel. 

For collective operations to become a real alternative, they must demonstrate 
their clear advantages over the send-recv primitives for parallel programming. 
In the rest of the paper, we consider the following five challenges that should 
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Fig. 2. Collective operations shown for a group of four processes. Each row of boxes 
represents data that reside in one process. 



be addressed by every new approach in parallel programming; we use these 
challenges to prove the benefits of collective operations over send-recv. 

Challenges for collective operations as an alternative to send-recv: 

— Simplicity: Are “collective” programs simpler and more comprehensible? 

— Programmability: Is a systematic program design process facilitated? 

— Expressiveness: Can main application classes be conveniently expressed? 

— Performance: Is the performance competitive with that using send-recv? 

— Predictability: Are program behaviour and performance more predictable? 

In the remainder of the paper, one section is devoted to each of the challenges. 
Each section opens by stating a commonly held, pio-send-recv opinion, which 
we somewhat polemically call a “M 3 dh”. We proceed by discussing theoretical 
and empirical results that refute the m^h and conclude with the “Truth” based 
on the presented facts. This “myths and truths” structure enables us to draw a 
clear conclusion about the suitability of collective operations as an alternative 
to send-recv. 

4 The Challenge of Simplicity 

Myth : Send-recv primitives are a simple and convenient way of specifying 
communication in parallel programs. 

To expose the invalidity of the simplicity m^hj we use a simple example 
MPI program, Get.datal, shown in Figure 3 (top). This program is taken 
almost verbatim from a popular MPI textbook [21], where it directly follows 




Send-Recv Considered Harmful? 



247 



the trivial Hello World example; thus, Get_datal can be viewed as one of the 
simplest truly parallel programs in the book. The C+MPI code in the figure 
accomplishes a simple task: one process (initiator) reads an input value, a, and 
broadcasts it to all other processes. To implement the broadcast more efficiently, 
the processes are organized in the program as a logical binary tree, with the ini- 
tiator at the root of the tree. Communication in the program Get_datal proceeds 
along the levels of the tree, so that each non-initiator process first receives the 
value and then sends it on. The main part of the code (functions Ceiling_log2 , 
I_send, I_recv) computes the height of the communication tree and finds the 
communication partners for each process, whereas the function Get.datal itself 
organizes communication along the levels of the tree. 

Despite the fact that the program in Figure 3 is even shorter than in the book 
(we broadcast one piece of data instead of three and skip almost all comments), it 
is still long and complicated, considering the simplicity of the accomplished task. 
Furthermore, the program is error-prone: even a slight imprecision in determining 
the partner processes may cause a deadlock during program execution. Note that 
the described tree communication structure is not artificial, but rather expresses 
one of the efficient patterns that are widely used in parallel programming. 

To demonstrate how collective operations simplify the program structure, 
we exploit the collective operation “broadcast”: in the MPI syntax, it is 
MPI_Bcast(). The resulting “collective” version of the program is shown in Fig- 
ure 3 (bottom). An immediate observation is that it is much shorter than the 
send-recv version, the size ratio being 6 vs. 34 lines of code. Skipping the part 
responsible for data input would result in an even more impressive saving: 3 vs. 
31 lines. 

The complexity of programming with send-recv has many more facets than just 
long program codes: 

Firstly, the intricate communication structure induced by send-recv compli- 
cates the debugging process. Special tools are required, which provide the 
programmer with a detailed trace of program execution. This approach to 
debugging is cumbersome and has natural limitations: program behaviour 
is non-deterministic, and some errors can be detected only on particular 
machine configurations, which makes complete testing infeasible. 

Secondly, if MPI is our language of choice then we have not just one send- 
recv, but rather 8 different kinds of send and 2 different kinds of recv. Thus, 
the programmer has to choose among 16 combinations of send-recv, some of 
them with very different semantics. Of course, this makes message-passing 
programming very flexible, but even less comprehensible! 

Truth : The apparent simplicity of send-recv turns out to be the cause of large 
program size and complicated communication structure that make both the de- 
sign and debugging of parallel programs difficult. 
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int Ceiling_log2(int x){ /* communication tree height */ 
temp = X - 1; result=0; 
while (temp != 0) { 

temp = temp » 1 ; 
result = result + 1 ;} 
return result ; 

} /* Ceiling_log2 */ 

int l_receive{ /* find partner to receive from */ 
power_2_stage = 1 « stage; 
if ( (power_2_stage <= my_rank) ScSc 

(my_rank < 2*power_2_stage) ){ 

*source_ptr = my_rank - power_2_stage ; 
return 1 ; 

} else return 0; 

} /* l_receive */ 

int l_send{ /* find partner to send to */ 
power_2_stage = 1 « stage; 
if (my_rank < power_2_stage){ 

*dest_ptr = my_rank + power_2_stage ; 
if (*dest_ptr >= p) return 0; 
else return 1 ; 

} else return 0; 

} /* l_send */ 
void Get_datal{ 

if (my_rank == 0){ /* in the root process */ 

printf ( "Enter a\n"); scanf ( "°/,f " , a_ptr) ; 

} 

for (stage = 0; stage < Ceiling_log2(p) ; stage++) 
if (l_receive (stage, my_rank, fesource)) 

MPl_Recv(a_ptr , 1, MP1_FL0AT, source, 

0, MP1_C0MM_W0RLD, ^status) ; 
else if (l_send(stage, my_rank, p, fedest)) 

MPl_Send(&a, 1, MPl.FLOAT, dest, 0, MP1_C0MM_W0RLD) ; 
} /* Get_datal*/ 



void Get_data2{ 
if (my_rank == 0) { 

printf ( "Enter a\n"); scanf ( "°/,f" , a_ptr) ; 

} 

MPl_Bcast(a_ptr, 1, MPl.FLOAT, 0, MP1_C0MM_W0RLD) ; 
} /* Get_data2 */ 



Fig. 3. Example program with send-recv (top) and collective operation (bottom) 
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5 The Challenge of Programmability 

Myth : The design of parallel programs is so complicated that it will probably 
always remain an ad hoc activity rather than a systematic process. 

We address here what is probably the most challenging issue in parallel pro- 
gramming: To what extent are systematic design of parallel programs and for- 
mal reasoning about them possible? Programs with collective operations can be 
viewed as sequential compositions of comparatively simple parallel stages [12], 
similarly to the “batch” supersteps in BSP [24], There are two kinds of stages: 
local computations in each process, and interprocess collective operations. Our 
goal is to reason about how can individual stages be composed into a complete 
program, with the ultimate goal of finding the best composition systematically. 

We will briefly summarize some semantics-preserving transformations for 
specific compositions of collective operations (for more detail, see [11]). They 
have been formally proved using the functional Bird-Meertens formalism [5]. 
For our purposes, we present these transformations in the C+MPI notation. 

The first transformation states that, if binary operators opl and op2 are 
associative and opl distributes over op2, then the following transformation of a 
composition of scan and reduction is possible: 

Make_pair ; 

MPI_Reduce (f (opl ,op2) ) ; 
if my_pid==R00T then Take_first; 

Here, the functions Make_pair and Take_f irst implement simple data arrange- 
ments that are executed locally, i.e. without interprocessor communication. The 
binary operator f (opl,op2) on the right-hand side is built using opl and op2 
from the left-hand side of the transformation. A similar transformation for two 
subsequent scan operations can be found in [11]. 

12 p 12 p 

Composition 





Time saved 





MPI_Scan (opl); 

MP I -Reduce (op2) ; 



Fig. 4. Fusing two collective operations into one by a transformation like (1). 



The effect of such transformations on an MPI program is that two subsequent col- 
lective operations are fused into one, with simple local computations beforehand 
and afterwards. This is illustrated in Figure 4 for a program with p processes. 
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Rule (1), and other similar transformation rules for collective operations 
presented in the sequel, have the following important properties: 

— Their correctness is proved formally as mathematical theorems. 

— They are parameterized by the occurring operators, like opl and op2, and 
are therefore applicable to a wide variety of applications. 

— They are valid for all possible implementations of the involved operations. 

— They can be applied independently of the parallel target architecture. 

Besides composition rules, there are also transformations that decompose one 
collective operation into a sequence of smaller operations. Here are two examples: 



MPI.Bcast =y 



MPI_Scatter ; 
MPI_Allgather ; 



MPI_Allreduce (op)=^ 



Red-scatter’ (op) ; 
MPI_Allgather ; 



Composition and decomposition rules can sometimes be applied in sequence; 
here is an example of such a combined transformation: 



Make_pair ; 

Red-scatter (f (opl ,op2) ) ; 

Take_f irst ; 

MPI_Allgather ; 

We have demonstrated elsewhere [11] how transformation rules of the kind 
presented here can be exploited in the design of parallel algorithms. The idea is 
to start with an intuitive, obviously correct but probably inefficient version of 
an algorithm and proceed by applying semantically sound transformation rules, 
until an efficient algorithm is obtained. To choose the right rule to apply at a 
particular point in the design process, we need to study the impact of the design 
rules on program performance. We will address this problem in Section 8. 

Truth : For collective operations, sound transformation rules can be developed. 
This enables a systematic program design process, in sharp contrast to the ad hoc 
programming using send-recv primitives. In the next sections, we demonstrate 
how the design process can be oriented towards predictable, higher performance. 



MPI_Scan(opl) ; 
MPI_Allreduce(op2) ; 



6 The Challenge of Expressiveness 

Myth : Collective operations are too inflexible and cannot express many impor- 
tant applications conveniently. 

To refute this quite widely held opinion, we present in Table 1 several 
important applications, which according to the literature were implemented us- 
ing exclusively collective operations without notable performance loss as com- 
pared with their counterparts using send-recv. 




Send-Recv Considered Harmful? 



251 



Table 1. Applications expressed using exclusively collective operations 



Application 


Communication/Computation Pattern 


Polynomial Multiplication 
Polynomial Evaluation 
Fast Fourier Transform 
Molecular Simulation 
N-Body Simulation 
Matrix Multiplication (Fox) 
Matrix Multiplication (3D) 


Beast (group); Map; Reduce; Shift 
Beast; Scan; Map; Reduce 
Iter ( Map; All-to-all (group)) 

Iter (Scatter; Reduce; Gather) 

Iter ( All-to-all; Map) 

Iter (Beast (group); Map; Shift (group)) 
Allgather (group); Map; All-to-all; Map 



Here, Map stands for local computations performed in the processes without 
communication; Shift is a cyclic one-directional exchange between all processes; 
Iter denotes repetitive action; (group) means that the collective operation is 
applied not to all processes of the program but rather to an identified subset of 
processes. In MPI, the groups are specified using the concept of communicators. 

Additional strong confirmation of the expressive power of collective opera- 
tions is provided by the PLAPACK package for linear algebra [25], which has 
been implemented entirely without individual communication primitives. 

Truth : A broad class of communication patterns to be found in parallel applica- 
tions is covered by collective operations, without any notable loss of performance. 

7 The Challenge of Performance 

Myth : Programs using send-recv are, naturally, faster than their counterparts 
using exclusively collective operations. 

High performance is the first and foremost reason to exploit parallel machines. 
However, the performance of parallel programs is known to be an inexhaustible 
source of highly contradictory discussions. Examples are the continuous debates 
on superlinear speedup, as well as papers that analyze the many tricks used to 
deceive the community in terms of performance figures. They all show clearly 
how difficult it is to discuss performance matters in the parallel setting. 

The usual performance argument in favour of individual communication is 
that collective operations are themselves implemented in terms of individual 
send-recv and thus cannot be more efficient than the latter. Although this is true 
to some extent, there are two important aspects here that are often overlooked: 

1. The implementations of collective operations in terms of send-recv are 
written by the implementers, who are much more familiar with the parallel 
machine and its network than an application programmer can be. Recently, 
hybrid algorithms have been proposed, which switch from one implemen- 
tation of a collective operation to another depending on the message size, 
number of processors involved, etc. A nice example is the MagPIe library 
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which is geared to wide-area networks of clusters [18], Such optimizations 
are practically impossible at the user level in programs using send-recv. Some 
implementations of collectives exploit machine-specific communication com- 
mands, which are usually inaccessible to an application programmer. 

2. Very often, collective operations are implemented not via send-recv, but 
rather directly in the hardware of a particular machine, which is simply 
impossible at the user level. This allows to fully exploit all machine 
resources and sometimes leads to rather unexpected results: e.g. a simple 
two-directional exchange of data between two processors using send-recv on 
a Cray T3E is two times slower than a version with two broadcasts [3]. 
The explanation for this phenomenon is that the broadcast is implemented 
directly on top of the shared-memory support of the Cray T3E. 

Below, we argue against some commonly held opinions about the performance 
superiority of send-recv over collective operations, basing our arguments on 
empirical evidence from recent publications: 

It is not true that send-recv is naturally faster than collective operations. Newer 
algorithms for collective communication [22] take into account specific char- 
acteristics of the interprocessor network, which can be then considered during 
the compilation phase of the communication library. In [23], the tuning for 
a given system is achieved by conducting a series of experiments on the sys- 
tem. In both cases, a nearly optimal implementation for a particular machine 
can be achieved automatically, without sacrificing portability. This is clearly 
almost impossible in an application program written using send-recv: the 
communication structure will probably have to be re-implemented for every 
new kind of network. It is further reported in [3] that the collective operation 
MPI.Bcast on a Cray T3E always beats send-recv. 

It is not true that nonblocking versions of send-recv, MPI.Isend and MPI_Irecv, 
are invariably fast, owing to the overlap of communication with computation. 
As demonstrated by [3], these primitives in practice often lead to slower 
execution than the blocking version, because of the extra synchronization. 

It is not true that the flexibility of send-recv allows smarter and faster algo- 
rithms than the collective paradigm. Research has shown that many designs 
with send-recv eventually lead to the same high-level algorithms as obtained 
by the “batch” approach [15]. In fact, batch versions often run faster [16]. 

It is not true that the routing of individual messages over a network offers 
fundamental performance gains as compared with the routing for collec- 
tive operations. As shown formally by Valiant [24], the performance gap in 
this case becomes, with large probability, arbitrarily small for large problem 
sizes. A variety of theoretically interesting and practical techniques have been 
proposed - two-stage randomized routing, coalescing messages by destina- 
tion, etc. - that attempt to exploit the full bandwidth of the network, at 
least to within a constant factor. 

Truth : While absolute parallel performance achieved on a particular machine 
remains a complex and fuzzy issue, there is strong evidence that send-recv does 
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not offer any basic advantages over collective operations in terms of performance. 
There are well-documented cases where collective operations are the clear winner. 
Furthermore, they offer machine-dependent, efficient implementations without 
changing the applications itself. 

8 The Challenge of Predictability 

Myth : The behaviour and performance of parallel programs are such complicated 
issues that information can only be obtained by actually running the program 
on a particular machine configuration. 

The major advantage of collective operations is that we can not only design 
programs by means of the transformations presented in Section 5, but also esti- 
mate the impact of an applicable transformation on the program’s performance. 



Table 2. Impact of transformations on performance 



Composition Rule 


Improvement if 


Scan_l ; Reduce_2 — )■ Reduce 


always 


Scan; Reduce — ^ Reduce 


ts > m 


Scan_l; Scan_2 — )■ Scan 


ts > 2m 


Scan; Scan — Scan 


ts > m(tni + 4) 


Beast; Scan Comcast 


always 


Beast; Scan_l; Scan_2 Comcast 


ts > m/2 


Beast; Scan; Scan —)■ Comcast 


ts > m{\tn, + 4) 


Beast ; Reduce — Local 


always 


Beast; Scan_l; Reduce_2 Local 


always 


Beast; Scan; Reduce —)■ Local 


t . f >1 

' m ''S — 3 



Table 2 contains a list of transformations from [13], together with the condi- 
tions under which the application of a transformation improves performance. 

Note that performance predictability is usually even more difficult to achieve 
than the absolute performance itself. To estimate performance, we must use 
some cost model and take into account a particular implementation of collective 
operations on the target machine. In the above table, a hyp ere ube- like imple- 
mentation of collective operations is presumed, and the cost model used has the 
following parameters: start-up/latency tg, transfer time tyj and block size m. 
These parameters are used in the conditions in the right column of the table. 
The estimates were validated in experiments on a Cray T3E and a Pars 3 dec 
GCel 64 (see [11] for details). 

Since the performance impact of a particular transformation depends on the 
parameters of both the application and the machine, there are several alterna- 
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tives to choose from in a particular design. Usually, the design process can be 
captured as a tree, one example of which is shown in Figure 5. 



MPLScan (op1): 
MPI_Allreduce (op2); 



opl distributes over op2 

yes^y^ no 



y Condition 1 


Condition 2 








Make_pair; 


Make_pair; 


MPI_Scan{op1): 


MPLReduce-scatter (op3); 


MPI_Allreduce (op3); 


MPLReduce-scatter (op2); 


Take_firstofpair; 


Take_firstofpair; 


MPI_Allgather; 


MPI_Allgather; 







MPI_Scan(op1): 

MPI_Allreduce(op2); 



Fig. 5. The tree of design alternatives. 



Conditions in the figure read as follows (see [11] for how they are calculated): 

Condition 1 = tg <2mtyj{\ogp — 1) J \ogp 
Condition 2 = C < m[tyj + 1 — {2tyj + l)/logp) 

The best design decision is obtained by checking the design conditions, which 
depend either on the problem properties, e.g. the distributivity of operators, or 
on the characteristics of the target machine (number of processors, speed of the 
channels, etc.). For example, if the distributivity condition holds, it takes us from 
the root into the left subtree in Figure 5. If the block size in an application is 
small. Condition 1 yields “no”, and we thus end up with the second (from left 
to right) design alternative, where op3 = f (opl,op2) according to rule (1). 

Note that the conditions in the tree of alternatives may change for a different 
implementation of the involved collective operations on the same machine. 

Arguably, send-recv allows a more accurate performance model than collec- 
tive operations do. Examples of quite detailed performance models, well suitable 
for finding new efficient implementations, are LogP and LogGP [17]. However, 
these models are often overly detailed and hardly usable for an application pro- 
grammer, as demonstrated by comparison with batch-oriented models [4,14]. 

Truth : Collective operations contribute to the ambitious goal of predicting 
program characteristics during the design process, i.e. without actually running 
the program on a machine. This progress would be impossible with send-recv, 
which make the program’s behaviour much less predictable. Furthermore, the 
predictablity of collective operations simplifies the modelling task at application 
level as compared with models like LogP. 
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9 Conclusion 

This paper proposes - somewhat polemically - viewing send-recv primitives as 
harmful and, consequently, trying to avoid them in parallel programming. We 
have demonstrated the advantages of collective operations over send-recv in five 
major respects, which we call challenges: simplicity, expressiveness, programma- 
bility, performance and predictability. We have presented hard evidence that 
many widely held opinions about send-recv vs. collective operations are mere 
mWlis which can be refuted. We strongly believe that collective operations are 
a viable alternative that already works well for many parallel applications. 

The following developments are necessary or are already under way in the 
drive to broaden the use of collective operations in parallel programming: 

— More evidence should be collected about the applicability and usefulness 
of collective operations for parallel programming. In particular, we expect 
new parallel algorithms to be developed, with collective operations as the 
programming mechanism in mind. 

— It may well be the case that the current set of collective operations provided 
by MPI needs adjustment to meet the requirements of programming practice. 

— We plan to extend the results on transformation rules presented here, with 
the goal of building a complete algebra of collective operations. 

— An experimental system for program development using transformation rules 
from Section 5 is described in [1]. 

— The research into new, efficient implementations for collective operations 
is very important to support their use in practice. Our current work also 
addresses new kinds of networks, including heterogeneous ones [17]. 

— Collective operations are successfully used not only in traditional commu- 
nication libraries like MPI but also in new programming environments for 
distributed systems, including e.g. Java RMI [20]. 

— New applications of parallel computing, as well as new computational plat- 
forms such as Grids are promising candidates for collective operations. 

In addition to the many arguments in this paper, our optimism with respect to 
the “collective communication thrust” is also based on the amazing similarities 
of its development to the history of the “structured programming thrust” : 

It is not easy to argue against a programming technology, like the goto or send- 
recv, that has been used for years by many programmers. However, in both 
cases an alternative is available which is also well known, so that no new 
constructs have to be learned by the users. 

A new thrust is often opposed by practitioners, while theoreticians get euphoric. 
So-called “structured compilers” were developed to automatically translate 
any program with gotos into its structured equivalent. Similarly, there is 
at least one project now under way, whose goal it is to translate programs 
with send-recv into their equivalents with collective operations [8]. While 
such research definitely contributes to a better understanding of the relation 
between different programming styles, its practical utility is uncertain, for 




256 



S. Gorlatch 



both the goto and send-recv. Our view is that the exclusive use of collective 
operations requires new parallel algorithms and a different programming 
methodology. A direct translation of old software can often result in poorly 
structured and inefficient programs. By analogy, Dijkstra advised against a 
mechanical translation into goto-\ess programs [9] . 

One of the major objections in both cases has been that both the goto and 
send-recv are “natural” mechanisms for expressing control flow and commu- 
nication, respectively. We find this argument for send-recv elusive, for the 
following reason. The main parallel programming style, SPMD, presumes 
a “collective view” of the program. Since the number of processes is a pa- 
rameter of an MPI program, it is difficult for the programmer to think in 
terms of particular processes: one does not even know how many of them 
are involved! Individual communication with send-recv disturbs this natural, 
collective view and should therefore be viewed as unnatural. 

Other pro-goto and pro-send-recv arguments have been the feared losses in 
expressiveness and performance. In case of the goto, these arguments have 
been refuted by the progress in compilers and software engineering. In the 
case of send-recv, they will be hopefully refuted by theoretical and empirical 
evidence in favour of collective communication, as in this paper. 

We conclude by paraphrasing Dijkstra’s famous letter [9] which originally 
inspired our work. Applied to the parallel setting, it might read: 

The numerous versions of send-recv, as they stand for instance in MPI 
(non-blocking, ready, synchronous, etc.), are just too primitive; they are 
too much an invitation to make a mess of one’s parallel program. 

We strongly believe that collective operations have every potential to avoid this 
mess and to enable the design of well-structured, efficient parallel programs. 

Acknowledgements 

It is a pleasure to acknowledge the helpful comments of Christian Lengauer, 
Thilo Kielmann, Holger Bischof, Vladimir Korneev and Phil Bacon. 

References 

1. M. Aldinucci, S. Gorlatch, G. Lengauer, and S. Pelagatti. Towards parallel pro- 
gramming by transformation: The FAN skeleton framework. Parallel Algorithms 
and Applications, 16(2):87-113, 2001. 

2. V. Bala et al. GGL: a portable and tunable collective communication library for 
scalable parallel computers. In Proc. 8th Int. Conf. on Parallel Processing. 

3. M. Bernashi, G. lannello, and M. Lauria. Experimental results about MPI collec- 
tive communication operations. In Ptigh- Performance Computing and Networking, 
Lecture Notes in Gomputer Science 1593, pages 775-783, 1999. 

4. G. Bilardi, K. Herley, A. Pietracaprina, G. Pucci, and P. Spirakis. BSP vs. LogP. In 
Eighth ACM Syrnp. on Parallel Algorithm,s and Architectures, pages 25-32, 1996. 




Send-Recv Considered Harmful? 



257 



5. R. Bird. Lectures on constructive functional programming. In M. Broy, editor, 
Constructive Methods in Computing Science, NATO ASI Series F: Computer and 
Systems Sciences. Vol. 55, pages 151-216. Springer Verlag, 1988. 

6. C. Bdhm and G. Jacopini. Flow diagrams, turing machines and languages with 
only two formation rules. Comm. ACM, 9:366-371, 1966. 

7. O.-J. Dahl, E. W. Dijkstra, and C. A.R.Hoare. Structured Programming. Academic 
Press, 1975. 

8. B. Di Martino, A. Mazzeo, N. Mazzocca, and U. Villano. Restructuring parallel 
programs by transformation of point-to-point interactions into collective commu- 
nication. Available at http://www.grid.unina.it. 

9. E. W. Dijkstra. Go To statement considered harmful. Comm. ACM, 11(3):147- 
148, 1968. 

10. A. Geist et al. PVM: Pamllel Virtual Machine. Mil' Press, 1994. 

11. S. Gorlatch. Towards formally-based design of message passing programs. IEEE 
Trans, on Software Engineering, 26(3):276-288, March 2000. 

12. S. Gorlatch and C. Lengauer. Abstraction and performance in the design of parallel 
programs: overview of the SAT approach. Acta Informatica, 36(9):761-803, 2000. 

13. S. Gorlatch, C. Wedler, and C. Lengauer. Optimization rules for programming with 
collective operations. In M. Atallah, editor, Proc. IPPS/SPDP’99, pages 492-499. 
IEEE Computer Society Press, 1999. 

14. M. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas. Towards efficiency and 
portablility. programming with the BSP model. In Eighth ACM Symp. on Parallel 
Algorithms and Architectures, pages 1-12, 1996. 

15. M. Goudreau and S. Rao. Single-message vs. batch communication. In M. Heath, 
A. Ranade, and R. Schreiber, editors, Algorithm,s for parallel processing, pages 
61-74. Springer- Verlag, 1999. 

16. K. Hwang and Z. Xu. Scalable Parallel Computing. McGraw Hill, 1998. 

17. T. Kielmann, H. E. Bal, and S. Gorlatch. Bandwidth-efficient collective commu- 
nication for clustered wide area systems. In Parallel and Distributed Processing 
Syrn,posium, (IPDPS 2000), pages 492-499, 2000. 

18. T. Kielmann, R. F. Hofman, H. E. Bal, A. Plaat, and R. A. Bhoedjang. Mag- 
Ple: MPFs collective communication operations for clustered wide area systems. 
In Proc. ACM SICPLAN Symposium on Principles and Practice of Parallel Pro- 
gramming (PPoPP’99), pages 131-140, 1999. 

19. Y. Kolosova, V. Korneev, V. Konstantinov, and N. Mirenkov. Yazik paralleljnykh 
algorithmov. In Vychsliteljnye Sistemy, volume 57. Nauka, 1973. In Russian. 

20. A. Nelisse, T. Kielmann, H. E. Bal, and J. Maassen. Object-based collective com- 
munication in java. In Joint ACM JavaCrande-ISCOPE 2001 Conference, 2001. 

21. P. Pacheco. Parallel Programming with MPI. Morgan Kaufmann Pubk, 1997. 

22. J.-Y. L. Park, H.-A. Choi, N. Nupairoj, and L. M. Ni. Construction of optimal 
multicast trees based on the parameterized communication model. In Proc. Int. 
Conference on Parallel Processing (ICPP), volume I, pages 180-187, 1996. 

23. S. S. Vadhiyar, G. E. Fagg, and J. Dongarra. Automatically tuned collective 
communications. In Proc. Supercomputing 2000. Dallas, TX, November 2000. 

24. L. Valiant. General purpose parallel architectures. In Handbook of Theoretical 
Computer Science, volume A, chapter 18, pages 943-971. MIT Press, 1990. 

25. R. van de Geijn. Using PLAPACK: Parallel Linear Algebra package. Scientihc and 
Engineering Computation Series. MIT Press, 1997. 




UNICORE: A Grid Computing Environment 
for Distributed and Parallel Computing 



Valentina Huber 

Central Institute for Applied Mathematics, Research Center Jiihch, 
Leo-Brandt-Str, D-52428 Jiihch, Germany 
v.huber@fz-juelich.de 



Abstract. UNICORE (UNiform Interface to COmputer REsources) 
provides a seamless and secure access to distributed supercomputer re- 
sources. This paper will give an overview of the its architecture, secu- 
rity features, user functions, and mechanisms for the integration of ex- 
isting applications into UNICORE. Car-Parrinello Molecular Dynamics 
(CPMD) application is used as an example to demonstrate the capabil- 
ities of UNICORE. 



1 Introduction 

The increasing number of applications using parallel and distributed process- 
ing, e.g. planetary weather forecast or molecular dynamics research, require the 
access to remote high performance computing resources through the Internet. 
Figure 1 gives the overview upon the geographical distribution of user groups 
working on the supercomputer complex of the John von Neumann Institute for 
Computing (NIC) in Jiilich. 

On the other hand, one of the today’s main difficulties is that the interfaces 
to supercomputing resources tend to be both complicated and vendor specific. 
To solve these problems, a project UNICORE [1] was funded in 1997 by the 
German Ministry for Education and Research (BMBF). The goal of two-years 
project and of the follow-on project UNICORE Plus [2] is to develop a seamless, 
intuitive and secure infrastructure that make the supercomputer resources trans- 
parently available over the network. Project partners are the German Weather 
Service (DWD), Research Center Jiilich (FZJ), Computer Center of the Univer- 
sity of Stuttgart (RUS), Pallas GmbH, Leibniz Computer Center, Munich (LRZ), 
Computer Center of the University Karlsruhe (RUKA), Paderborn Center for 
Parallel Computing (PC2), Konrad Zuse Center, Berlin (ZIB), and Center for 
High Performance Computing at TU Dresden (ZHR). The project is structured 
in eight sub-projects dealing with software development, quality management, 
public key infrastructure (PKI), resources modeling, application specific support, 
data management, job control flow, and meta-computing. 

The main idea is to allow users to run jobs on the different platforms and 
locations without the need to know details of the target operating system, data 
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Fig. 1. User groups of NIC Jiilich. 

storage techniques, or administrative policies at the supercomputer sites. The 
graphical interface enables the user to create, submit and control jobs from 
the local Workstation or PC. UNICORE supports multi-system and multi-site 
applications for one job. This allows to use the optimal system and resources for 
the each part of a given problem. In the multi-step jobs the user can specify the 
dependencies between tasks, e.g. temporal relations or data transfer. Currently, 
execution of scripts, data transfer directives, and CPMD tasks in the batch mode 
are supported. 

To create a seamless environment, jobs and resources are represented in ab- 
stract terms and units. The UNICORE servers translate the Abstract Job Objects 
(AJOs) into platform specific commands and options and schedules the tasks 
to honor dependencies. The autonomy of sites remains unchanged. The unique 
UNICORE user identifiers (certificates) will be mapped to local account names 
(Unix logins). 

The developed software is installed at the German HPC centers for the target 
systems like CRAY T3E, T90, Eujitsu VPP, IBM SP2, Siemens hpcLine. 

2 UNICORE System Architecture 

UNICORE lets the user prepare or modify structured jobs through a graphical 
interface, the UNICORE Client, a Java-2 application, on a local UNIX Work- 
station or a Windows PC. The intuitive GUI for batch submission has the same 
look-and-feel independent of target system and provides the full information 
about resources to the user. Jobs can be submitted through the Job Preparation 
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Fig. 2. UNICORE architecture. 



Agent (JPA) to any platform of a UNICORE Grid, where the user has a local 
account, and the user can monitor and control the jobs through the Job Mon- 
itor/Controller (JMC). Figure 2 presents the UNICORE system components 
and their interaction. 

The JPA constructs a AJO with the definition of a job and contacts a UNI- 
CORE Gateway at a selected site. To support this selection, the JPA queries the 
availability of sites and the addresses of the corresponding gateways from the 
central UNICORE server (currently at http://www.unicore.de). 

The Gateway, a small java-application running at the target site, authenti- 
cates the user through user’s X.509 certificate and provides the user with the 
information about available resources at the site. It consigns a AJO to the ap- 
propriated Network Job Superviser (NJS) server. 

Each target system or cluster of systems, is controlled by one NJS, also more 
then one NJS can be installed on a site. The NJS, a Java application, provides 
the resource information from the Pncarnation Database (IDB) to the Gateway 
and checks the authorization of the user to use the requested resources from the 
User Database (UUDB). It substitutes the site- independent UNICORE login 
(Ulogin), which is based on a valid user certificate, with the corresponding lo- 
cal Unix login (Xlogin) on the destination system. For the target sites, which 
require additional security, e.g. DCE (Distributed Computing Environment), a 
Site-specific Object (SSO) of the AJO will be translated onto the corresponding 
procedures and commands to provide site-specific additional authentication. The 
NJS incarnates the abstract tasks destined for a local host into real batch jobs 
using the IDB and execute them through the Target System, Interface (TSI) on 
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the batch subsystem. The tasks to be run at a remote site will be passed to a 
peer Gateway. 

The TSI is a daemon, a small perl-script, running on the target system, 
which submit the jobs to the local Batch Subsystem, e.g. NQS, and returns 
implicit output (stdout, stderr, log-files) from the jobs to the NJS, where they 
are retained for access by the user. Any temporary files, created during running 
of jobs, are automatically deleted. The Export files (see ’’Preparation of Jobs”), 
remain available at the location, specified by the user, on the target system or 
will be transferred to the local workstation or PC. 

A low-level protocol layer between components, called the UNICORE Pro- 
tocol Layer (UPL) provides authentication, SSL communication and transfer of 
data as b 3 de-streams. The security is based on the Java implementations of SSL 
and the Java Cryptography Extensions [3] of the Institute for Applied Informa- 
tion Processing and Communications [4] at the Graz University of Technology. 
A high-level layer (AJO class library) contains classes to define UNICORE jobs, 
tasks, status and resource requests. 

The authentication of users and components (Gateways and NJSes) is based 
on certificates issued by a UNICORE Certification Authority ( CA ). It is located 
at LRZ in Munich and meets the regulations defined by the DEN-PCA (German 
Research Network - Policy Certification Authority) [5]. The partner centers run 
a Registration Authority (RA ). 

3 Application Specific GUIs 

The general basis for the integration of applications into UNICORE is the usage 
of the ExecuteScript Task. The ExecuteScript task contains the definition of a 
script, the sequence of commands to be executed on the target system, and the 
list of input and output files for the application. In addition the user can select 
several execution contexts, e.g. MPI-1, PVM, Debug, Profile, C, Eortran, etc. 
These contexts are predefined execution environments and are used for example 
to run parallel programs using MPI. The UNICORE Client provides the user 
with information about the available resources for each task and their limits, e.g. 
the minimum and the maximum number of processors on the selected machine. 
The Transfer Task is used for the transfer of files from one site to another one. 

Eurthermore, the users have the possibility to integrate new or already ex- 
isting application specific GUIs as plug-ins into the UNICORE Client. The 
plug-ins are modules that are specifically written to extend the capabilities of 
the UNICORE Client. They use the standard function of the UNICORE Client 
for authentication, security, data transfer, submission and monitoring of jobs, 
and provide additional support for the applications, e.g. the specification of li- 
braries, the preparation of the configuration files, etc. 

Each application plug-in consist of some wrapper classes. The plugin class 
extends the Job Preparation Menu of the UNICORE Client with the options to 
add an application task to the job tree and provides the UNICORE Client with 
the information about other plug-in classes. The TaskContainer class constructs 
the AJO for the particular application task and the JPAPanel class presents 
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Fig. 3. GUI for the CPMD task. 

the application specific GUI. The system plug-ins are located in the unicore- 
c/*ent/plugin directory. In addition the user can specify the plug-in directory for 
own plug-ins in the ” User Default^' dialog. The UNICORE Client scans these 
directories for plug-ins at start-up, loads them, and displays the application 
specific GUI in the JPA by the selecting of the corresponding icon representing 
an application task. 

We selected the widely used Gar-Parrinello Molecular Dynamics code [9] as 
a first application to be integrated in UNIGORE. GPMD is an ab initio Elec- 
tronic Structure and Molecular Dynamics program; since 1995 the development 
is continued at the Max-Planck Institute fiir Eestkdrperforschung in Stuttgart 
[11]. This application uses a large amount of GPU time and disk space and is 
the ideal candidate for a Grid application. Gurrently, multi processor versions 
for IBM Rise and Gray PVP systems and parallel versions for IBM SP2 and 
Gray T3E are available. 

The developed GPMD interface provides the users with an intuitive way to 
specify the full set of configuration parameters (specification of the input and 
output data sets, pseudopotentials, etc.) for a GPMD simulation. 

4 Preparation of Jobs 

Eigure 3 shows the input panel for one GPMD task, in this case a molecular 
dynamics run. It is divided into four areas: Properties, the configuration area for 
the GPMD calculation, data Imports and data Exports. 

The Properties area contains global settings like the task name, the task’s 
resource requirements and the task’s priority. The resource description includes 
the number of processors, the maximum GPU time, the amount of memory, 
the required permanent and temporary disk space. The JPA knows about the 
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minimum and the maximum values for all resources of the execution system, 
where the task is to be run, and incorrect values are shown in red. 

The configuration area contains the application specific information. It in- 
cludes the specification of the input file, required for the CPMD program [10]. 
The button Generate brings up the tool CPMD Wizard, developed at Research 
Center Jiilich, which generates the CPMD input data automatically. Experienced 
users may use the data from existing jobs, stored on the local computer. The con- 
figuration data may be edited directly or through the Wizard. It is also possible 
to save data as a text file on the local disk. 

For all atomic species, which will be used in the CPMD calculation, the path 
to the pseudopotential library has to be specified. The local pseudopotential files 
will be automatically transferred to the target system. Alternatively, the user 
can specify the remote directory for the pseudopotentials. If this field is empty, 
then the default library on the destination system will be used. 

The Imports area describes the set of input files for the CPMD calculation, 
e.g. a restart file to reuse the simulation results from a previous step. The input 
files may reside on the local disk or on the target system. Local files marked 
@LOCAL are automatically transferred to the target system and remote files 
will be imported to the job directory. 

The Exports area controls the disposition of the result files to be saved after 
the job completion. In the example some of the output files will be stored on 
the target system and others, marked QLOCAL, will be transferred to the local 
system and can be visualized there. 

Before the CPMD job can be submitted to a particular target system, the 
interface automatically checks the correctness of the job. Prepared jobs can be 
stored to be reused in the future. 

UNICORE has all the functions to group CPMD tasks and other tasks into 
jobs. Each task of a job may execute on a different target host of the UNICORE 
Grid. The job can be resubmitted to a different system by changing the target 
system. UNICORE controls the execution sequence, honoring dependencies and 
transfers data between hosts automatically. 

Figure 4 represents an example of a CPMD job consisting of two steps: 
si8-optimize task for the wavefunction optimization of a cluster of 8 Silicon atoms 
and siS-mdrun task for molecular dynamics run. Both tasks will be executed on 
the same system, T3E in Jiilich. The left hand side of the JPA represents the 
hierarchical job structure. The green color of the icons indicates the job as Ready 
for submission. The second task will be run only after the first one is completed. 
It uses the output files from the siS-optimize task to reuse the results of the 
wavefunction optimization. This dependency is shown on the right hand side 
and represents a temporal relation between the tasks. 

5 Monitoring of Jobs 

The user can monitor and control the submitted jobs using the job monitor part 
[JMC) of the UNICORE Client. The JMG displays the list of all jobs the user 
has submitted to a particular system. The job, initially represented by an icon. 
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Fig. 4. CPMD job consisting of two tasks and dependency between them. 



that can be expanded to show the hierarchical structure. The status of jobs or 
parts of jobs are given by colors: green - completed successfully, blue - queued, 
yellow - running, red - completed not successfully, etc. It is possible to delete a 
job, which has not begun execution, or to terminate running jobs. 

After a job or a part of a job is finished, the user can retrieve its output. A 
completed job retains in the list of jobs until the user removes it. 

Figure 5 presents the status of the jobs submitted to the T3E system in 
Jiilich. The right hand side displays the summary standard output and standard 
error from two steps si8-optimize and siS.mdrun of CPMDsiS job. 

6 Outlook 

The technique of allowing independent task to execute simultaneously on differ- 
ent machines and independent child AJOs to execute simultaneously at different 
sites, supported by UNICORE, provides an alternative to the asynchronous par- 
allelism. In addition, one of the sub-projects aims to extend the capability of 
UNICORE to allow the metacomputing in the usual sense, which typically re- 
quires support of synchronous message passing. 

The technique used for the CPMD integration is extensible to numerous other 
applications. We plan to develop the interfaces for MSC-NASTRAN, FLUENT 
and STAR-CD applications. These interfaces are going to be integrated into the 
UNICORE Client for seamless submission and control of jobs. In the future it is 
planned to build a generic interface to allow easier integration of applications. 

The first production-ready version of the UNICORE system has been de- 
ployed for operational use at the German HPC centers. 
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Fig. 5. The Job monitor displays the status of the jobs submitted to a particular 
system. 
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Abstract. The efficient solution of many large-scale scientific calculations 
depends on unstructured mesh strategies. For example, problems where the 
solution changes rapidly in small regions of the domain require an adaptive mesh 
strategy. In this paper we discuss the main algorithmic issues to be addressed with 
an integrated approach to solving these problems on massively parallel 
architectures. We review new parallel algorithms to solve two significant problems 
that arise in this context: the refinement mesh and the linear solver. A procedure 
to support parallel refinement and redistribution of two dimensional 
unstructured finite element meshes on distributed memory computers is 
presented. The parallelization of the solver is based on a parallel conjugate 
gradient method using domain decomposition. The error indicator and the 
resulting refinement parameters are computed in parallel. 



1 Introduction 

The unstructured mesh strategies have proven to he very successful in reducing the 
computation and storage requirements for many scientific and engineering calculations. 
Massively parallel computers offer a cost-effective tool for solving such problems. However, 
many difficult algorithmic and implementation issues must be addressed to make effective 
use of this resource. In this paper, we review the major aspects of an unstructured mesh 
strategy and present an integrated approach to deal with these aspects on distributed 
memory machines. We also present computational results from a preliminary 
implementation of this approach. The irregular and evolving behavior of the 
computational load in adaptive strategies on complex domains becomes problematic 
when parallel distributed-memory machine implementations are considered. Complete 
parallelizations of these methods necessitate additional and difficult stages of 
partitioning, parallel refinement and the redistribution of the refined mesh. Many 
heuristics have been devised to partition the initial unstructured mesh and hence 
minimize the load imbalance and interprocessor communication among processors. 
The redistribution of the refined mesh can also be done by parallelizing similar 
partitioning heuristics. 
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Adaptive finite element methods, driven by automatic estimation and control of 
errors have gained importance recently due to their ability to offer reliable solutions 
to partial differential equations. Starting with a coarse initial mesh 
the calculation of an approximate solution; 

the estimation of the distribution of the discretization error over the domain under 
consideration; 

the generation of an improved mesh by a complete remeshing of the domain (h- 
version of adaptivity); 

if the current partitioning indicates that it is adequately load balanced, control is 
passed back to the solver; 

otherwise, a repartitioning procedure is invoked to divide the mesh into 

subdomain; 

remapping the data ; 

are executed repeatedly until the global error is within a desired tolerance. We propose 
an algorithm in which adaptivity and parallel computations, based on an automatic 
domain decomposition. 



2 Parallel Adaptive Mesh Refinement 

In this paper we consider adaptive refinement of triangular meshes by simple 
bisection. The longest side bisection of triangle is a partition of the triangle by the 
midpoint of its longest edge and the opposite vertex. An uncompatible edge is a 
common edge for triangle pair so that edge divided in one of triangles [1]. Other 
possible approaches, and more detail of the following algorithms, are given in [2,3]. In 
this paper, we present new parallel algorithm for the adaptive construction of 
nonuniform meshes. 

The parallel mesh refinement is the refinement of distributed mesh. The initial 
mesh has been partitioned and distributed among processors. A mesh part will called a 
submesh. The partition was produced so that submeshs intersection was either a shared 
vertex or a shared edges. Refinement is made by longest side bisection. The 
refinement process consist of two steps: 

Step 1 . Divide all refinement triangles. 

Step 2. Divide all triangles with uncompatible edges. 

All triangles with uncompatible edges must be found for the second step. In the 
single processor case all search information are stored on the processor. In case of 
many processors an uncompatible edge may be a shared edge. If uncompatible edge is 
a shared edge than a processor in which a divided edge lie send message on the 
adjacent processor. Adjacent processor receive message but it is impossible find this 
edge using edge local number. A single numbering of mesh objects is required in this 
case. We used unique numbers produced from edge coordinates [4]. Unique numbers 
values are choose in the big range ( 0..2' ) and used in place of search keys. The 
search has been based on hashing with open addressing. Collisions in hash table has 
been resolved by linear probing. Edges unique numbers has been stored on every 
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processor in edge hash table. Adjacent triangles numbers and processor membership 
of these triangles has been stored in two tables. These tables are adjusted with edge 
hash table. We differed an internal edges, a shared edges, a boundary edges and found 
adjacent triangles and submeshs by using this information: 

Every processor realized next algorithm. 

Step 1 . Divide refinement triangles from this submesh. 

Step 2. While uncompatible edges number of all mesh more than 0. 

Step 2. 1 While uncompatible edges number of this submesh more than 0. 

Divide triangles with uncompatible edges from this submesh. 

Step 2.2 Define divided edges number. 

Step 2.3 Exchange numbers values of shared edges which are divided 
in adjacent submeshs. 

Step 2.4 Let uncompatible edges number is equal number of shared edges which 
are divided in adjacent submeshs. 

Let uncompatible edges are shared edges which are divided 
in adjacent submeshs. 

Step 2.5 Define uncompatible edges number of all mesh. 

Local mesh refinement decrease an effect of parallel EE equations system building 
and solving owing to the load unbalance. After parallel mesh refinement we had 
applied dynamic load balancing. 



3 Dynamic Load Balancing 

The dual graph representation of the initial computational mesh is one of the key 
features of this work. Parallel implementation of adaptive solvers requires a 
partitioning of the computational mesh such that each element belongs to an unique 
partition. Communication is required across faces that are shared by adjacent elements 
residing on different processors. Hence for the purposes of partitioning, we consider 
the dual of the original computational mesh. The elements of the computational mesh 
are the vertices of the dual graph. An edge exists between two dual graph vertices if 
the corresponding elements share a face. A graph partitioning of the dual thus yields 
an assignment of triangle to processors. The finite element mesh partitioning library 
ParMetis [5] has been used to obtain the element-balanced partitions. ParMetis is an 
MPI-based parallel library. We used algorithm for refining a k-way partitioning that is 
a generalization of the Kernighan-Lin/Eiduccia-Mattheyses (Refine Kw ay). Eor 
partitioned mesh that are highly imbalanced in localized areas, diffusion-based load 
balancing scheme (LDiffusion) is used to minimize the difference between the original 
partitioning and final repartitioning by making incremental changes in the 
partitioning to restore balance. The next balances the load by computing an entirely 
different p-way partition, and then intelligently mapping the new partition to the old 
one such that the redistribution cost is minimized (Remap, MLRemap). Load 
balancing is required before the computation after every step refinement mesh. The 
performance results of various mesh partitioning algorithms are summarized (Eig.3-6). 
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4 Error Estimation 



We have tested our algorithm using linear elasticity equations on a variety of 
geometry. For error estimation of problem smoothing the stress field, is used. The 
basic idea of error estimators is to substitute the field exact stress, which is generally 
unknown, by the field a*, obtained by means of recovery procedures. Usually, the 
conjugate approximation method [6] can be used, which consists in solving the 
following linear systems of equations 

Gcr; =^,k = 1,2,3 (1) 

where three systems of equations exist in (1), one for each stress component k The coefficients 
of G are defined by : 

and are thus similar to those of a consistent mass matrix of the structure with unit mass 
density, f^^. is the vector of the k component of nodal stresses, f^ is the A: component of 

stress at node N . The coefficients of f ^ are defined by : 

f^^ =f (3) 

*■ J Q, 

Therefore, the expression for computing the approximate (estimated) relative error 
distribution can be expressed as 

The contribution of all the elements in the mesh is given by 




where M is total number of elements. 

The relative percentage error in the energy norm for the whole domain can be 
obtained as 



ri = y|-*100% 

where ||w|| is given by 



( 6 ) 



u 




(7) 



A criterion for an “optimal” mesh consists of requiring that the energy norm error be 
equidistributed among elements because it leads to meshes with high convergence. 
Thus, for each element 
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Ik II = Wlkir +iKir)/^ 

By defining the ratio 




(8) 

(9) 



it is obvious that refinement it need if 

i >1.0 



( 10 ) 



The selected element are refinement according to algorithm reduced in section 2. 



5 Parallel Conjugate Gradient Method 



Usually in finite element method it is considered, that a mesh and system of linear 
equation are assembled so that each node belongs to several elements. Let’s assume 
return, the set of elements and element matrixes is teared. Let be any 

global matrix and e R"’"" an element contribution to A . We can write 

A = ^ CjA'C^ , where C, e R"’'^ is called a Boolean matrix and has the property 



eR"""', D =C CCC , D, eR""" 

where - diagonal matrix containing for each node of a number of elements, with 
which belongs to this node. Diagonal matrix D e R'"’""' is composed from D^' blocks. 
Block-diagonal matrix AgR ” , m= Mxn, is composed from A' blocks. Matrix 
A' and vectors T' are incomplete, i.e. coefficients associated with nodes of element 
(subdomain) i and j do not contain the contributions form adjacent element 
(subdomain). We have termed the exchange of coefficients required to create 
complete vectors operation 



q, 

i = \ 



( 11 ) 



This operation represents the assembly and subsequent extraction of element 
vectors. Operation (11) is implemented through an exchange of data among the 
elements that share a global node. Element contributions may be assembled to 
subdomain. Each processor contains the portion w. of the global vectors (e.g., p, u, r. 



etc.) required by the solution algorithm associated with the elements assigned to that 
processor. 

Algorithm for solving the system for parallel computer can be implemented 
as indicated by following code: 

- A6R'"'"'\p,r,n,q,qGR"‘' , y,,p,,p,a,,,C5,e,T,,,C5,,e„p,, gR'-; 
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= 0 , r„ = Cf , p, = (r„ , Dr„) , p = 10 * ; 
For k = 0,1,2,...; 




= Ap, 


( 11 ) 


q, =CXC"q, 


( 12 ) 


=(q<,Dq,) 


( 13 ) 


9, =(q<,DL,) 


( 14 ) 


T, =(P,,Dq,) 


( 15 ) 


/' /' P 


( 16 ) 


a, = - — 
r„ 


( 17 ) 


u,, = u, - «,p. 


( 18 ) 


F,., =r, +«/,q/, 


( 19 ) 


= Pi, + 2a,,0 + a;o) 


( 20 ) 


if <e exit; 

Po 






( 21 ) 


P„ 




P«., =r« -Ap« 


( 22 ) 


The element matrix and vector calculations can obviously be made in parallel. 
Note that the matrix-vectors products can be carried out at the element level within 
each subdomain. Let us further partition the matrix-vector product to distinguish nodes 
in interior from those on the interface. The matrix-vector product separates into two 
parts. Provided the interior submatrix-vector product is sufficiently large, this splitting 
may permit complete overlap of communication and computation. This strategy can be 
implemented efficiently using MPl procedures MPl_lsend and MPlJrecv to perform 
non-blocked communications. 
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As against reference algorithm [7] a inner product of vectors of discrepancies on 
iteration was evaluated through three dot products (14) - (16) on previous step; 

(h+p h^,) = X ) =X + m , ) = 

i=l i=l 1=1 

m m m 

= E 6. 6. + 2«, E 6. + E = (>■/, ’ !■/, ) + 2«, (q , , r, ) + a; (q , , q, ) , 



The three inner products in (14) - (16) can now he calculated using only a single 
commutative operation to perform the three global summations. 

Non-blocking communication and single communicative operation have reduced 
execution time on 15-25%. 

We used this algorithm for solving system (1) with some k sides. In this case all 
vectors dimension is raised in k times. Iterations are completed when out condition 
for all processes are fulfilled. This computing organization decrease communications 
number. 



6 Results 

Consider now a crack problem in linear elasticity. The parameters of the structure are: 
£ = 1,0, V = 0-3 . Due to the symmetry, only a half of structure will be analyzed with 
an uniform initial mesh. The mesh is adaptivity refined according to the energy norm 
until the local error estimate for each triangle is less than a specified tolerance. 
Fourteen refinement steps are carried out using error estimators (I)-(IO). The 
computational experiments are performed with linear elements and using an error 
tolerance T| = 4%. The coarse mesh and initial partitioning are shown in Fig. 1. In 
Fig. 2 refinement mesh is decomposed with the MLRemap algorithm into eight 
subdomains. The linear systems for stiffness and error estimation problems are solved 
by using the parallel conjugate gradient method (12) - (23). Although the primary 
focus of this paper is the adaptive refinement algorithm, it is important to examine the 
performance of the algorithm both individually and from context of the complete 
problem solution. Thus, we have included the matrix assembly, linear solution, error 
estimation, mesh refinement, mesh partitioning and remapping in our experimental 
results (Fig. 3). The algorithm without balancing was fulfilled more twice slowly. 
We can three major conclusions from plots. First, we find that the refinement time are 
compared to partitioning time. Second, the time to solve stiffness system and system 
(1) dominates the time to refine the mesh. Finally, we can see that the total adaption 
time is always less several percent of the total execution time. In Fig. 4 ,5 the quality 
of the obtained partitioning for each step is shown. Note, that the number of common 
edges (length of boundaries of subdomains) for all algorithms is approximately 
identical. In Fig. 6 the number of elements moved after everyone step is shown. 




Parallel Adaptive Mesh Refinement with Load Balancing for Finite Element Method TTi 




Fig. 1. Initial mesh partition (51 triangles, 37 nodes, T| = 33,48% ). 




Fig. 2. Mesh after 14 refinement step ( 17224 triangles, 8744 nodes, T| = 3,89% ), 



Our experiments were run on up to 8 nodes of the Parsytec CC-8 machine at the 
Institute of Applied Mechanics, The machine is equipped with PowerPC 604 thin- 
nodes (133 MHz) with at least 64MB of memory. The top-level message-passing calls 
are implemented throught MPI [8], 
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0 Refinement 0 Load Balancing Q] Remaping 




LDiffusion MLRemap Remap RefineKway 



Fig. 3. Task execution time with some algorithm of load balancing 
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Remap • RefineKway 




Fig. 4. Common edges quantity Ec. 



Unbalance and middle elements number 



Parallel Adaptive Mesh Refinement with Load Balancing for Finite Element Method 275 



— A — LDiffusion --o--MLRemap Remap 

— 9 — RefmeKway — ■ — Middle 




/' p 

Fig. 5. Unbalance max(M; ) — min(M, ) and middle elements quantity 
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Fig. 6. Moving elements number. 
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Abstract. Specification for structural synthesis of programs (SSP) contains 
information needed for introducing concurrency into a synthesized program. 
We explain how this can be used in a multithreaded computing environment, in 
particular, in a Java environment. We discuss strategies of coarse-grained 
multithreaded execution of synthesized programs: composing threads and 
imposing parallelism on subtasks. 



1 Introduction 

A structurally synthesized functional program does not have constraints on execution 
order other than imposed by data dependencies and by logical constraints explicitly 
expressed in pre- and post-conditions of functions. From the other side, its 
specification contains explicit and easily usable information about all data 
dependencies that must be taken into account when synchronizing concurrent 
execution of its parts. This can be used for parallelization of structurally synthesized 
programs. Still, the existing implementations of the structural synthesis of programs 
(SSP) produce code for sequential execution in one single thread [5], although the 
first works on concurrent execution of programs obtained by structural synthesis 
appeared long ago [3]. Also the parallel computing models developed and 
investigated in [2] were quite close to the specifications for SSP, and could have been 
used for introducing concurrency into structurally synthesized programs. The ideas 
from [2] were to some extent used in the packages developed for parallel processing 
on the NUTS platform [6]. 

At present we have a new implementation of SSP on the basis of Java [1] that 
supports both multithreading, and an easy way to organize concurrent computations in 
the network of workstations. Consequently, technically good possibilities exist, both 
for using fine-grained and coarse-grained concurrency in the implementation of 
structurally synthesized algorithms. The question is how to parallelize computations 
automatically, because this is needed for programs, synthesized dynamically at run- 
time. First, we discuss our idea of synthesis of concurrent programs using dataflow 
synchronization in general, second, we consider a multithreaded implementation of 
structurally synthesized algorithms in Java. Finally, we discuss strategies of coarse- 
grained parallelization on the basis of information available in the specifications for 
SSP under the assumption that no help for parallelization can be taken at run-time 
from the user. 
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2 Multithreaded Execution of a Synthesized Program 

The SSP uses a simple constructive logic that allows us to specify dataflow among 
functions. For instance, let us consider an example in a more restricted logic than SSP 
uses that still illustrates our approach. Input for synthesis is given as a collection of 
formulae that includes among others the following three formulae: f : A— >B, g : U— >V, 
and h : BaV— ^X, where A— ^B, U— ^V, and BaV— are specifications of functions f , g, 
and h. The dataflow between function f , g, and h is described by their respective 
specifications; i.e. the output of f (which is specified by B) and the output of g 
(specified by V) are used as inputs for h. The propositional variables A, B, U, V, and X 
are types that express the conceptual roles that their respective objects play in 
computations. 

Current implementations of the SSP extract from a proof of existence of an object 
an algorithm, which is then translated into byte code [5] or source code of a particular 
target programming language [1]. In either case, the resulting program is sequentially 
executed in one single thread. The following figure depicts the synthesis process of a 
sequential program for computing an object specified by X. 




Fig. 1. Program synthesis 

A concurrent implementation of a structurally synthesized algorithm is constructed of 
a set of dynamically created guarded threads and shared input/output objects that are 
used for communication between guarded threads. Shared input/output objects are 
available to guarded threads through an environment. A guarded thread represents a 
function selected from the proof of the existence of the result. The shared objects 
serve as "communication channels" for passing the result of one function (thread) to 
other functions (other threads). As in the dataflow computation model, a guarded 
thread can perform its computation if its required input objects in the computation 
exist and they are bound, so that the thread can get an input object and operate on it 
when needed. If a required input object is not available yet (it has not been bound 
yet), the thread is suspended while waiting for the input object to be bound. When the 
thread completes, it binds its computation result to its output object. It may happen 
that the output object still contains the result bound by a previous instance of the 
thread, i.e. the output object was not yet consumed by another thread. In dataflow. 
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such situation is called data collision (or collision of tokens that represent the data). 
The simplest way to avoid data collisions is to suspend the thread that tries to bind its 
computation result to its respective output object until the output object can be reused 
for binding the computation result. This approach requires maintaining a queue of 
threads suspended on the shared object that cannot be reused. We use a more efficient 
mechanism to avoid data collisions, known as dynamic dataflow. Every instance of 
one and the same guarded thread is associated with a different instance of the 
environment. Every environment creates its own input/output objects. A guarded 
thread knows only those objects that are needed for input and output. A shared object 
can be considered as a "write-once" (or "bound-once") object. 

A guarded thread does the following: 1) it waits until all objects needed as inputs 
become available, 2) if all objects are available then it executes, and 3) it binds its 
computation result to its respective output object. Let us assume that we have a 
programming platform with some means to realize dataflow synchronization - 
waiting for input values. A simple idea is to use dynamic dataflow synchronization on 
guarded threads that represent separately every function of the proof of the existence 
of the result. In this way one constructs a concurrent implementation of a structurally 
synthesized program, where each computational step (execution of one function) is 
encapsulated in a thread, which is illustrated in the following figure. 




Execution Environment 




Thread .3 

get b from B, get v from V 

exec(h(b,v)) 

bind result h(b.v) to X 



Thread 2 
get u from U 
exec(g(u)) 
bind result g(u) to V 



Thread 1 
get a from A 
exec(f(a)) 

bind result f(a) to B 



Fig. 2. Guarded threads 

Due to dataflow synchronization (wait until an object is bound) thread 1 and 2 will 
run in parallel and thread 3 must wait until both threads 1 and 2 have finished their 
computation, i.e. have bound their outputs to the respective objects. 



3 Guarded Threads 

In this section we discuss the concurrent implementation of structurally synthesized 
programs in Java. Although Java is not designed for massively concurrent 
programming, we are still able to implement concurrent applications in Java. It 
supports threads and provides synchronization primitives that can be used to 
implement dataflow synchronization. We are well aware of that there are other 
programming languages that do better in terms of concurrent programming, the 
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Mozart [7] system, which implements Oz, for instance. The programming language 
Oz possesses a built in dataflow synchronization mechanism [8]. 

There are several ways of how to implement concurrent programs of structurally 
synthesized algorithms. For simplicity, and to keep our examples more 
comprehensible, we shall use easily understandable source code and avoid using 
sophisticated reflection tools. In our example, we assume that the functions f , g, and 
h are methods of the class MyClass. Their internal structure is uninteresting for us. 
First, we explain how we implement dataflow synchronization in Java. Data 
dependencies among functions are expressed by propositional variables of the 
respective specifications of the functions under consideration. For example, function 
h depends on function f and g specified by their common propositional variables B 
and V, where B and V occur in the premise of the specification of h and B (V resp.) 
occurs in the conclusion of the specification of function f (g resp.). In case of Java, 
propositional variables of specifications become parameters of methods of classes in 
extracted programs [1], where they are used as input (if they occur on the left hand 
side of the method specification) and output (if they occur on the right hand side of 
the method specification). To implement dataflow synchronization we simply 
encapsulate an input/output object in a wrapper object, which is an instance of an 
Ob j ectWrapper class. The ObjectWrapper class defines two methods, get 
and bind. Guarded threads use the method get to obtain an input object, and they use 
the method bind to bind their respective function result to its respective wrapper 
object. If a guarded thread invokes the method get on a wrapper object then this 
thread is blocked in the method call of get if the wrapper object has not yet bound an 
object, i.e. the method bind has not yet been invoked on this wrapper object. As soon 
as the method bind is invoked on this wrapper object, all guarded threads that are 
blocked in get will execute again. A guarded thread is simply a class that 
encapsulates the execution of a method. It invokes the method get on all wrapper 
objects that wrap the input objects of the encapsulated method (f), and invokes bind 
on that wrapper object that wraps the output object of the encapsulated method (f). 
The following code example is an implementation of the guarded thread class that 
encapsulates the method f (similar for the methods g and h). 



class GuardedThread_f extends Thread { 

Concurrentlmpl env; 

GuardedThread_f (Concurrentlmpl env, int id) { 
super ( "Guarded Thread ID: [" + id + "] ") ; 

this. env = env; 
this . start ( ) ; 

} 

void run{) { 

// get the value of A 
Object o = this . env .A. get 0 ; 

// execute the method f and bind the result to B 
this . env . B . bind (this . env . f (o) ) ; 




The method main of the class Concurrentlmpl (concurrent implementation) 
creates the execution environment (object of class Concurrentlmpl, wrapper 
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objects) for the concurrent execution of needed guarded threads. After all guarded 
threads are created, the input objects of our synthesized concurrent program, which 
are proper objects for A and U, are bound to their respective wrapper objects. The 
program then waits until the result (which is a proper object for X) is computed. 



class Concurrentlmpl extends MyClass { 

Ob j ectWrapper A = new Obj ectWrapper { ) ; 

Ob j ectWrapper X = new Obj ectWrapper () ; 

// realization of AaU->X 

public static void main (String [] args) { 

Concurrentlmpl env = new Concurrentlmpl () ; 
new GuardedThread_f (env, 1) ; 
new GuardedThread_g (env, 2) ; 
new GuardedThread_h (env, 3) ; 

// bind initial values 
env. A. bind (args [0] ) ; 
env. U. bind (args [1] ) ; 

// wait until goal X is computed 
System. out . print In (env . X . get ( ) ) ; 




Note that the implementations of our classes do not implement the control part of 
the synthesized algorithm explicitly. The computation of the result is guided by 
dataflow synchronization. 

The SSP uses a logic that is much more expressive than we used here in our 
example. For instance, the formula (U— ^V) a X — ^ Y specifies that we can 
compute a proper object for Y if we have a proper object for X and if we can find a 
realization of the subtask (U— ^V) . This subtask receives a proper object for U from 
the function that receives this subtask as input, and computes a proper object for V. In 
Java one can implement subtasks as classes [1]. In case of concurrent 
implementations of structurally synthesized algorithms it is possible to implement 
subtasks as threads as well, which then themselves create guarded threads (similar to 
our main method of class Concurrentlmpl). The fact that subtasks are threads 
gives us the possibility to execute concurrently one and the same synthesized branch 
(subtask) by creating more than one object of the respecting subtask class. Imposing 
parallelism on subtasks will be discussed in section 4.2. 

The general pattern of a class that realizes a subtask is very much similar to the 
class Concurrentlmpl. In addition a subtask class is also a thread class that 
implements a subtask method. This subtask method, which is invoked by the outer 
environment, starts the subtask thread. 

The idea to use dataflow synchronization on guarded threads enables us to execute 
a synthesized algorithm in a maximally parallel way. From the other side, the fine 
granularity of threads may give heavy implementation overhead, and can be practical 
in the environments like Mozart, but not on Java platforms of today. The granularity 
of parallelization of structurally synthesized programs will be discussed in the 
following sections. 
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4 Coarse-Grained Parallelization 

An obvious way to decrease the execution overhead for each function is to put several 
functions into one and the same thread - to decrease the granularity of threads. Some 
experience of coarse-grained parallel usage of synthesized programs exists already. In 
[5], programs for large simulation problems have been synthesized and run on a 
network of workstations. However, only pre-programmed parallelism of subtasks has 
been used in this case. We shall discuss the following two cases with the aim of using 
the coarse-grained parallelization completely automatically: 1) Composing threads, 
and 2) Imposing parallelism on subtasks. 

Here we are going to use a representation of a structurally synthesized program in 
the form of a higher-order dataflow scheme (HODS) [4]. Nodes of a HODS are 
functional nodes and control nodes. The control nodes have suhtasks. They not only 
exercise control over the execution order of synthesized branches, but perform 
computations as well. The figure shows such a scheme with four functional nodes b, 
c, d and e, and a control node a with one suhtask Q. As usual for the SSP, we denote 
data dependencies as going in the direction of dataflow, showing explicitly also the 
data items as nodes of the scheme. We use small letters for representing data items. 




Fig. 3. Higher-order dataflow scheme 



The scheme shows an algorithm for computing y from x by performing computations 
under the control of the node a. This node uses a subtask Q for computing z from u 
and V, possibly repetitively. When computing for the subtask, two branches: b and 
c;d can be performed concurrently. Parallelization is possible also for the subtasks: 
depending on the data dependencies in the node a, it may he possible to arrange the 
computations for each set of values of u and v (repeating computations for the 
subtask Q) concurrently. How much should he done concurrently, and what should be 
done at one site, depends on the properties of computations for each functional node. 
Any attempt to find an optimal solution leads to NP complete problems. Considering 
the large size of schemes we are handling in SSP (up to thousands of nodes), looking 
for optimal solutions is implausible. Therefore we consider the following heuristic 
techniques. 
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4.1 Composing Threads 

We build threads in order to execute them concurrently and look for maximal 
sequences of steps that can be performed in one thread sequentially. A functional 
node in HODS may have several inputs, like the node a in figure 4. Therefore it may 
be included in different threads (see [3]). We have decided to compose threads in such 
a way that a thread can be run without synchronization with other threads after its 
execution had started (i.e. the synchronization is needed only for starting a thread). 
Therefore, a node with input from more than one thread (like the node a in figure 4) 
will be always the first in a thread. This is motivated hy the fact that threads will be 
built only in the case when computations in nodes are so extensive that concurrency 
gives some performance advantage. 




4.2 Imposing Parallelism on Subtasks 

Control nodes implemented in Java can be easily programmed in a multithreaded 
way. Knowing the usage of a control node, it may be possible to decide in advance 
whether its subtasks should be executed concurrently. It is possible as well to include 
several control nodes into a specification for synthesis that differ only hy their 
implementation, and to use extra variables in a specification to show which 
implementation (sequential or concurrent) is needed in a particular case. 

A rich set of control nodes for concurrent execution was developed for the 
distributed computing platform NUTS that has been described in [6]. Here we give an 
example of a control node for parallel processing of collections that implement the 
Java Enumeration interface, see figure below. A collection a is processed element 
by element, each time processing an element x of the collection and computing a new 
element y of the resulting array c. The subtask P specifies what has to be done with 
an element x from the collection a and to get an element of the array c. It is assumed 
that the subtask P is computationally heavy, and the computations for elements of the 
collection are performed in parallel. 



5 Concluding Remarks 

In this paper we have shown how to use the information already existing in a 
specification for structural synthesis of a program for concurrent implementation of 
the synthesized program. We see several possibilities of usage: multithreaded 
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execution of functions, parallel execution of composed threads, and distributed 
implementation of a coarse grained concurrent program. 




Parallel execution of P 



5 Concluding Remarks 

In this paper we have shown how to use the information already existing in a 
specification for structural synthesis of a program for concurrent implementation of 
the synthesized program. We see several possibilities of usage: multithreaded 
execution of functions, parallel execution of composed threads, and distributed 
implementation of a coarse grained concurrent program. 

The advantage of the proposed method relies in achieving concurrency without 
requesting additional information from a user, a comparatively small effort of 
implementation, and a composed program does not implement the control part of the 
synthesized algorithm explicitly, the computation of the program's result is guided by 
dataflow synchronization. Computational overhead implementing fine-grained 
parallelism may be high if pre-programmed functions are small. In this case, 
composing guarded threads should be taken into consideration. 
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Abstract. In this paper by means of a model of associative parallel 
systems with vertical data processing (the STAR-machine), we propose 
a natural straight forward implementation of the Bellman-Ford shortest 
path algorithm. We represent this algorithm as the corresponding STAR 
procedure, justify its correctness and evaluate time complexity. 



1 Introduction 

Problems of finding the shortest paths are among fundamental tasks of com- 
binatorial optimization. An important version of the shortest path problem is 
the single-source problem. Given a directed n-vertex and m-arc weighted graph 
with a distinguished vertex s, the single-source shortest path problem is to find 
for each vertex v the length of the shortest path from s to v. When all arc weights 
are non-negative, the most efficient solution gives Dijkstra’s sequential shortest 
path algorithm [3]. In [4], Ford generalizes Dijkstra’s algorithm for graphs having 
negative arc weights but without cycles of the negative weight. 

The most efficient solution of the single-source shortest path problem for 
general network topologies gives the Bellman-Ford algorithm [1,4]. On conven- 
tional sequential computers, it takes O(n^) time for complete connected graphs 
and 0[nm) time for sparse graphs [2]. 

In this paper, we study a matrix representation of the Bellman-Ford algo- 
rithm on a model of associative (content addressable) parallel systems of the 
SIMD type with vertical processing (the STAR-machine). To this end, we use a 
group of new basic procedures for updating graphs with the negative arc weights 
[8]. Here, we propose a natural straight forward implementation of the Bellman- 
Ford algorithm on the STAR-machine and justify its correctness. Assuming that 
each elementary operation of the model under consideration (its microstep) takes 
one unit of time, we obtain that the corresponding STAR procedure takes 0{h'n? ) 
time, where h is the number of bits required for coding the maximal weight of 
the shortest paths from the source vertex. 

* This work was supported in part by the Russian Foundation for Basic Research 
under Grant N 99-01-00548 
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2 The STAR— machine 

The model is based on a Staran-like associative parallel processor [5], We define 
it as an abstract STAR-machine of the SIMD type with bit -serial (vertical) pro- 
cessing and simple single-bit processing elements (PEs) [6], The model consists 
of the following components: 

- a sequential control unit (CU), where programs and scalar constants are 
stored; 

- an associative processing unit consisting of p single-bit PEs; 

- a matrix memory for the associative processing unit. 

The CU broadcasts an instruction to all PEs in unit time. All active PEs 
execute it simultaneously while inactive PEs do not perform it. Activation of a 
PE depends on the data. 

Input binary data are loaded in the matrix memory in the form of two- 
dimensional tables in which each datum occupies an individual row and it is 
updated by a dedicated PE. The rows are numbered from top to bottom and 
the columns - from left to right. Both a row and a column can be easily accessed. 

The associative processing unit is represented as h vertical registers, each 
consisting of p bits. The vertical registers can be regarded as a one-column 
array. The bit columns of the tabular data are stored in the registers which 
perform the necessary Boolean operations. 

The STAR-machine run is described by means of the language STAR which is 
an extension of Pascal. Let us briefly consider the STAR constructions needed for 
the paper. To simulate data processing in the matrix memory, we use data types 
word, slice, and table. Constants for the types slice and vjord are represented 
as a sequence of symbols of {0, 1} enclosed within single quotation marks. The 
types slice and vjord are used for the bit column access and the bit row access, 
respectively, and the type table is used for defining the tabular data. Assume 
that any variable of the type slice consists of p components which belong to 
{0, 1}. Eor simplicity, let us call “slice” any variable of the type slice. 

Now, we present some elementary operations and predicates for slices. 

Let X,Y he variables of the type slice and i be a variable of the type integer. 
We use the following operations: 

SET(T) sets all components of Y to 'T; CLR(T) sets all components of Y to 
'0'] Y[i) selects the i-th component of Y; END(y) returns the ordinal number 
i of the first (the uppermost) '1' of Y , i > 0; STEP(T) returns the same result 
as END(T) and then resets the first 'Y found to 'O'. 

In the usual way we introduce the predicates ZERO(T) and SOME(y) and 
the bitwise Boolean operations X andY , XorY, not Y , X xor Y . 

Let T be a variable of the type table. We use the following two operations: 

ROW(i,r) returns the i-th row of the matrix T; COL(i,T) returns the i-th 
column of T . 

RerriMrk 1. Note that the STAR statements are defined in the same manner 
as for Pascal. We will use them later for presenting our procedures. 
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3 Preliminaries 

Let G = iy,E,w) be a directed weighted graph with the set of vertices V = 
{1,2, ... ,n}, the set of directed edges (arcs) E CV xV and the function w that 
assigns a weight to every edge. We assume that |L| = n and \E\ = m. 

A weight matrix for G* is an n x n matrix which contains arc weights as 
elements. If [vi,Vj) ^ E, then w[vi,Vj) = oo. 

An adjacency matrix A for G is an n x n Boolean matrix in which aij = 1 if 
{vi,Vj) € ^ Gj = 0, otherwise. 

A path from tt to t; in G is a sequence of vertices u = V\,V 2 , . . . ,Vk = v, where 
G E for i=l,2,...,A; — 1 and A; > 1. 

The shortest path between two vertices in G is a path with the minimal sum 
of weights of its arcs. 

The distance from Vi to Vj is the weight of the shortest path between these 
vertices. 

Now, recall three basic procedures from [6] implemented on the STAR- 
machine which will be used later on. 

The procedure TMERGE(i/', A, L’) writes the rows of the given matrix T, 
selected by ones in the slice X, into the matrix E. Other rows of the matrix E 
are not changed. 

The procedure TCOPY(T,h, F) writes the given matrix T, consisting of h 
columns, into the result matrix E. 

The procedure TCOPYl(jE,j , h, E) writes h columns from the given matrix 
T, beginning with its (1 + (j — f)/j)-th column, into the result matrix E, where 

j > 1- 

The following three basic procedures from [8] use a given global slice X to 
select by ones positions of the rows which will be processed. These procedures 
are applied to an array which includes the negative integers. Such an array is 
represented as a matrix which saves only the magnitudes of the integers written 
in binary code and a slice which saves only the signs. We assume that every 
negative integer is indicated by one in this slice. 

The procedure MIN* (T,X,Y,Z) uses the slice Y to save the signs of the 
matrix T. It defines positions of those rows of the matrix T, where minimal 
elements are located. This procedure returns the slice Z, where Z{i) =' 1' if and 
only if either X[i) =' V , Y[i) =' 0' and ROW(i,T) is the minimal element or 
X[i) =' V , Y (i) =' 1' and ROW(i,i/') is the maximal element. 

The procedure HIT*(d’, E, X, Y, Z, Zl) uses the slices Y and Z to save the 
signs of the given matrices T and R, respectively. It defines positions of the 
corresponding coincident rows of the matrices T and R considering the signs. 
This procedure returns the slice Z\, where Zl(i) =' 1' if and only if X(i) =' 1', 
ROW(i,T)=ROW(i,R) and Y(i) = Z{i). 

The procedure ADDV* (i/', R,X,Y, Z, E,Z\) uses the slices Y and Z to save 
the signs of the matrices T and R, respectively. It performs the algebraic addition 
of the rows of the matrices T and R taking into account the signs. The procedure 
writes the magnitude of the result in the matrix E and the signs in the slice Z\. 




288 



A.S. Nepomniaschaya 



4 Representing the Bellman— Ford Algorithm 
on the STAR— Machine 

We first explain the main idea of the Bellman-Ford algorithm. 

This algorithm sets temporary labels for the vertices so that on terminating 
the A;-th iteration (/;>!) every label is equal to the length of the shortest path 
from the source vertex s to the corresponding vertex and this path includes no 
more than A; + 1 arcs. To perform this, the algorithm saves a set of vertices U 
whose labels are changed at the current iteration. 

To present the Bellman-Ford algorithm, we will use the following notations 
from [2]. 

For every vertex Wj, let us assume that T(wj) = {vj : ^ Vj G E} and 

= {vk : f A; — t Vi € F’}. If (7 = • • ,Vr}, then we have T((7) = 

r{vi). Let be the label for the vertex Vi after terminating the fc-th 

iteration. 

The Bellman-Ford algorithm runs as follows. 

Initially U = r{s), l^{s) = 0, Vwi € T(s) l^{vi) = w[s,Vi) and = oo, 

otherwise. 

For every vertex Vi G r[U), its label is updated at the fc-th iteration [k > 1) 
as shown below: 

= min E w{vj,v.{)}], (1) 

vjeTi 

where Ti = n (7. In other words, the set Ti includes those vertices for 

which the current shortest path from s consists of k arcs and there is an arc 
entering the vertex v,i. If v,i ^ T((7), then 

The termination of this algorithm is defined as follows: 

(i) If A; < n— 1 and Vi) = l^{vi) for all Vi, then the algorithm terminates. 
The labels for the vertices are equal to lengths of the shortest paths. 

(ii) If A; = n — 1 and f^{vi) for some Vi, then the algorithm 

terminates with the message: 'There is a cycle of the negative weight.' 

If A; < n — 1 and for some Vi, then U = {vi : / 

l^[vi)} and the [k + l)-th iteration will be performed. 

On the STAR-machine, the Bellman-Ford algorithm is represented as proce- 
dure BelFord. The graph G is given by means of a weight matrix T which stores 
only the weight magnitudes and a matrix Q which stores only the weight signs. 
Let us agree that each negative weight is indicated by one in the matrix Q. To 
represent Wij = oo in the matrix T, we choose an integer r = T^=i where 
7 j is the magnitude of the maximal weight of arcs incident from the vertex Wj. 
Let inf be the binary representation of r and let h be the number of bits in this 
representation. Then the matrix T consists of hn bit columns and for every i 
the weights of arcs, entering the vertex Wj, are written in the i-th field having 
h bit columns. It should be noted that in view of formula (1) the length of the 
shortest path from s to every vertex of G is less than r. 

The procedure BelFord uses the following input parameters: 
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the weight matrix T and the matrix of signs Q; the source vertex s; the 
number of bits h; the binary word inf for representing oo. 

The procedure returns the distance matrix D which stores only magnitudes 
of the distances and the slice Z which stores only the corresponding signs. 

Note that the distance from s to Wj is written in the i-th row of D and 
Z{i) =' V if and only if this distance is negative. 

The procedure BelFord uses the following main variables: 

an adjacency matrix A] a matrix A1 which is obtained after the transpose of 
the matrix A] a matrix M for computing the new labels for the vertices which 
are accessible from s at the current iteration; a matrix Ml for saving the new 
labels for the same vertices as the matrix M ; a slice Y 1 for storing positions of 
the vertices whose new labels are negative; a slice U for saving positions of the 
vertices whose labels are changed at the current iteration; a slice Y for saving 
positions of the vertices which are accessible from the vertex s at the current 
iteration. 

The run of the procedure BelFord includes the following stages. 

At the first stage, the matrices A and A1 are defined. Then the matrix D 
and the slice Z are initialised. 

At the second stage, positions of the vertices Vi which are adjacent with s 
are stored in the slice U . The weights of the arcs (s,Vi) are the labels for the 
vertices Vi. 

At the third stage, for every vertex Vi selected by one in the slice U , positions 
of all vertices Vj G r{vi) are saved in the slice Y . 

At the fourth stage, for every vertex Vp selected by one in the slice Y , the 
value Vp) is defined as follows: 

- first, positions of vertices Vj G Tp = F^^{vp) fi U are defined in parallel; 

- then, using the basic procedure ADDV*, the expression l^{vj) + w[vj,Vp) 
is computed for all vertices Vj G Tp in parallel, magnitudes of the results are 
saved in the corresponding rows of the matrix M and the signs in the slice Z2] 

- finally, by means of the basic procedure MIN*, the position of in 

the matrix M is selected. The magnitude of this value is stored in the p-th row 
of the matrix M 1 and the sign in the p-th bit of the slice Y 1. 

At the fifth stage using the basic procedure HIT*, positions of the correspond- 
ing coincident rows of the matrices M 1 and D are selected in parallel. 

At the sixth stage, the termination of the procedure BelFord is verified in the 
same manner as described in the Bellman-Ford algorithm. 

At the seventh stage by means of the basic procedure TMERGE, new values 
of the labels are written in the matrix D. Moreover, positions of vertices whose 
labels have been changed at the current iteration are stored in the slice U . After 
that, stage 3 is performed. 

RemMrk 2. Note that positions of arcs, entering Vi, are selected by ones in 
the i-th column of the matrix A, while positions of arcs, outgoing from v^, are 
selected by ones in the i-th column of the matrix Al. 

Remark 3. Obviously, after terminating stage 4, new values l^^^ivp) have 
been written in the matrix M 1 for all Vp selected by ones in the slice Y . 
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5 Execution of the Procedure BelFord 

To present the procedure BelFord, we need the following auxiliary procedures: 
The procedure AD](T,h,n,inf,A) returns the adjacency matrix A for the 
given matrix T. It runs as follows. For each i using TCOPYl, this procedure first 
selects the i-th field of T consisting of h bits. Then, using the basic procedure 
MATCH [7], it defines positions of the rows not coincident with the binary string 
inf and sets ones in the same positions of the i-th column of A. 

The procedure TRANS(A, n, Al) returns the matrix Af being the transpose 
of the matrix A. It runs as follows. For each i it defines positions of ones in the 
i-th row of A and sets ones in the same positions of the i-th column of Af. 

The procedure lNlT(T,Q,h,n, s, D, Z) returns the matrix D and the slice 
Z. It runs as follows. By means of the operation TRIM [6], it “cuts” the s-th 
row of the matrix T into n substrings, each consisting of h bits, and writes each 
i-th substring in the i-th row of D. Then, it defines positions of ones in the s-th 
row of the matrix Q and sets ones in the same positions of Z . 

Now, we present the procedure BelFord. 

proc BelFord(T,Q: table; h,n,s: integer; inf: word; 
var D: table; Z: slice); 

var A,A1,M,M1,R: table; 

U,X,X1,Y,Y1,Z1,Z2: slice; 
i,k,p: integer; w: word; 

1. begin ADJ(T,h,n,inf ,A) ; 

2. TRANS(A,n,Al) ; 

3. INIT(T,q,h,n,s,D,Z) ; 

4. X:=C0L(s,Al) ; U:=X; 

/* Positions of vertices, being adjacent to the vertex s, 
are selected by ones in the slices X and U . */ 

5. k:=l to n-1 do 

6. begin TCOPY(D,h,Ml) ; 

7. Y1:=Z; CLR(Y) ; 

8. while SOME(X) do 

9. begin p:=STEP(X); 

10. Xl:=C0L(p,Al); 

11. Y:=YorXl 

12. end; 

/*■ Positions of vertices which are accessible from s at the k- th 
iteration are selected by ones in the slice Y . */ 

13. X:=Y; 

/*■ The slice X is used to determine the new value for U . */ 

14. while SOME(Y) do 

/*■ At the fc-th iteration, we will define the distance from s 
to every vertex selected by one in the slice Y . */ 

begin p:=STEP(Y); 

Xl:=C0L(p,A); 



15. 

16. 
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17. Xl:=XlandU; 

/=!= Positions of arcs, entering the vertex Vp, are selected 
by ones in the slice XI. */ 

18. TCOPYl(T,p,h,R) ; 

19. Zl:=C0L(p,Q); 

/*■ The weights of arcs entering the vertex Vp, are written 
in the matrix R and their signs in the slice Zl. */ 

20. ADDV*(R,D,X1,Z1,Z,M,Z2); 

/*■ The result of adding the corresponding rows of R and D, 
selected by ones in XI, is written in M and the signs in 

21. w:=R0W(p,D) ; 

22. R0W(p,M) :=w; 

/=!= The p-th row of the matrix D is written in the p-th row 
of the matrix M . */ 

23. Z2(p) :=Z(p); Xl(p):=>l>; 

/*■ Position of the p-th row is indicated by one in the slice XI, 
and its sign is saved in the p-th position of the slice Z2. */ 

24. MIN*(M,X1,Z2,Z1); 

25. i:=FND(Zl); w:=R0W(i,M) ; 

26. R0W(p,Ml) :=w; Yl(p):=Z2(i) 

/* The value R^^[vp) is saved in the p-th row of the matrix Ml 
and its sign in the p-th position of the slice Y\. */ 

27. end; 

28. HIT*(D,M1,X,Z,Y1,Z1) ; 

29. Z2 : =X and ( not Zl) ; 

30. if ZER0(Z2) then exit; 

31. if k=n-l then 

32. begin message 'There is a cycle of negative weight'; 

33. exit 

34. end; 

35. TMERGE(M1,Z2,D) ; Z:=Y1; 

/* New values for the labels are stored in the matrix D 
and their signs in the slice Z. */ 

36. X:=Z2; U:=Z2 

37. end; 

38. end. 



Theorem. Let a directed weighted graph G be given as the matrix T which stores 
only the weight magnitudes and the matrix Q which stores only the weight signs. 
Let s be the source vertex, and there is no a directed cycle from s having the 
negative weight. Let every arc weight use h bits and let inf be the binary repre- 
sentation of infinity. Then the procedure BelFord(T,Q,h,n,s,inf,D,Z) returns the 
distance matrix D, in whose every i-th row there is the magnitude of the distance 
from, s to Vi, and the slice Z which stores the signs of the corresponding distances. 
R takes 0[hn^) time on the STAR-machine having no less than n RRs. 
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The theorem is proved by induction on the number of iterations. We omit the 
proof because of lack of space. 

Let us evaluate time complexity of the procedure BelFord. We first observe 
that the auxiliary procedures take 0(n) time each. Obviously, the procedure 
BelFord performs n — I iterations. Since it updates no more than n vertices in 
each iteration and the basic procedures run in 0[h) time each [6-7], we obtain 
that the procedure BelFord runs in 0{h'n?) time on the STAR-machine with n 
PEs assuming that each elementary operation takes one unit of time. 

6 Conclusions 

We have proposed a matrix implementation of the classical Bellman-Ford short- 
est path algorithm on the STAR-machine being a model of associative parallel 
systems of the SIMD type with vertical processing. We have obtained that the 
procedure BelFord takes time on the STAR-machine having no less 

than n PEs assuming that each elementary operation takes one unit of time. It 
should be noted that the procedure BelFord performs 0[n?) operations of ad- 
dition and 0[n) operations of comparison for complete connected graphs, while 
the Bellman-Ford algorithm executes 0[n^) such operations on conventional 
sequential computers [2]. 

We are planning to select all cycles of the negative weights and to restore 
the shortest paths from s to every vertex v along with finding the distances by 
means of a simple modification of the procedure BelFord. 
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Abstract. This paper describes a mechanism for “fusing” concurrent in- 
vocations of exclusive methods. The target of our work is object-oriented 
languages with concurrent extensions. In the languages, concurrent invo- 
cations of exclusive methods are serialized; only one invocation executes 
immediately and the others wait for their turn. The mechanism fuses 
multiple waiting invocations to a cheaper operation such as a single in- 
vocation. The programmers describe fusion rules, which specify method 
invocations that can be fused and an operation that substitutes for the 
invocations. The mechanism works effectively in the execution in syn- 
chronization bottlenecks, which are objects on which exclusive methods 
wait a long time for their turn. We have implemented a language that 
has the mechanism and tested the usefulness of the mechanism through 
experiments on a symmetric multiprocessor, the Sun Enterprise 10000. 
We have conhrmed that the mechanism made programs with synchro- 
nization bottlenecks fast. 



1 Introduction 

Most of concurrent object-oriented languages and concurrent extensions to 
object-oriented languages have a mechanism for serializing concurrent method 
invocations invoked to the same object simultaneously. An example of the mech- 
anism is synchronized methods in Java: only one synchronized method can exe- 
cute on the object at one time. In this paper we use the term exclusive method to 
denote a method whose concurrent invocations to the same object are serialized. 

This paper describes a scheme for efficient execution of dynamically serial- 
ized invocations of exclusive methods. This scheme “fuses” multiple invocations 
of exclusive methods to a cheaper operation such as a single invocation of an 
exclusive method. The scheme serves to reduce the number of executions of ex- 
clusive methods. For example, when a method invocation that adds one and a 
method invocation that adds two are waiting on a counter object, we replace 

* An extended version of this paper is available via 
http: / / WWW. osss.is.tsukuba.ac.jp/~yosh/publications/ 
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these invocations with a method invocation that adds three to the counter. An- 
other example occurs in GUI programs in which the repaint method is invoked 
to a window object. The scheme fuses multiple invocations of the repaint method 
that are invoked to one object almost at the same time into a single invocation of 
the method. Below we call this scheme method fusion. Method fusion works well 
particularly in the execution of synchronization bottlenecks, which are objects 
on which exclusive method invocations wait a long time for their turn. 

The target of this work is concurrent object-oriented languages and concur- 
rent extensions to object-oriented languages that are implemented on shared- 
memory multiprocessors. 

Various techniques have been proposed so far for efficient execution of exclu- 
sive methods. They can be classified into two main groups. Those in one group 
concurrently execute a combination of exclusive methods that update a distinct 
set of variables [4,20,22]. Those in the other create replicas of synchronization 
bottlenecks [3,18]. Both techniques have a problem. The former cannot optimize 
a combination of exclusive methods that may update the same variable. For ex- 
ample, they cannot optimize multiple invocations of the add method described 
above. The latter do not allow the programmers to concisely describe dynamic 
changes in executed methods. Method fusion addresses the above problems. 

The contributions of this work are shown below. 

— We propose a novel optimization that makes the execution of serialized invo- 
cations of exclusive methods faster. We design a language that has an API 
to support the optimization. The language is called Amdahl. 

— We develop an implementation scheme for Amdahl. 

— We incorporate method fusion into the Amdahl compiler and confirm the 
usefulness of method fusion through experiments on 64-processor symmetric 
mult iprocessors . 

The rest of this paper is organized as follows. Section 2 gives the overview 
of method fusion and Sect. 3 describes our language Amdahl. Section 4 shows 
sample programs written in Amdahl. In Sect. 5 we discuss the design of Amdahl 
and in Sect. 6 we explain an implementation scheme for Amdahl. Section 7 gives 
our experimental results and Sect. 8 describes related work. Section 9 concludes 
this paper. 

2 Overview of Method Fusion 

The following Java program provides a good starting point for understanding 
method fusion. 

class Counter { 
private int value; 

public Counter(int v) { value = v; } 

public synchronized void inc(int n) { value += n; } 

public synchronized void dec (int n) { value -= n; } 
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public synchronized int get() { return value; } 

} 

A Counter object keeps an integer counter value. The method inc increments 
the counter value by its argument, and the method dec decrements the counter 
value by its argument. The method get returns the counter value. 

The point to observe is that the execution of inc (x) and inc (y) has the same 
“effect” on the counter object as the execution of inc(x + y) with regard to 
the final value of the counter. Based on this observation, we attempt to “fuse” 
dynamically serialized invocations of inc (x) and inc (y) to an invocation of 
inc(x + y). 

Amdahl programmers can describe fusion rules, which specify a pair of 
method invocations that can be fused and an operation that substitutes for 
the invocations^. For example, a fusion rule can be added to the definition of 
the Counter class as follows: 

class Counter { 

fusion void inc (int x) & void inc (int y) { inc(x + y) ; } 

} 

The rule tells the compiler that 

execution of inc (x) and inc (y) can be replaced with execution of 
inc(x + y) . 

According to the rules, the compiler and runtime system may fuse two invoca- 
tions of the method inc to one invocation of the method inc. The invocation 
inc (x + y) , invoked as a result of the fusion, may further be fused with another 
invocation of the method inc. 

3 The Amdahl Parallel Language 

Our language Amdahl is C++ extended by adding threads and exclusive meth- 
ods. It does not support inheritance. 

3.1 Threads and Exclusive Methods 

Amdahl has a primitive for thread creation. When the primitive is executed, a 
thread is created. 

Multiple exclusive methods cannot execute concurrently on the same object. 
Non-exclusive methods, on the other hand, executes concurrently with any other 
methods. Exclusive methods have the keyword sync at the head of their dec- 
laration. The order in which threads call an exclusive method on the object is 
independent of the one in which threads actually execute it on the object (i.e., 
the FIFO scheduling order is not guaranteed). The recursive acquisition of the 
lock by its owner is not allowed. 

^ It may be possible to generate useful fusion rules automatically based on program 
analysis, but exploring this issue is not our concern here. 
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3.2 Fusion Rules 

Amdahl programmers can define the behavior of method fusion by describing a 
set of fusion rules in class definitions. The syntax of fusion rules is shown below. 

fusion tp x„J & tq q{ty^ yi,...,ty^ j/„) { 

S 

} 

p and q are method names, p may be the same as q. ..., j/i, ..., and 
are distinct variables, ..., , ..., and ty^ are the types of the variables 

xi, ..., Xjn, Hi, ■■■, and j/„, respectively, tp and tq are the types of the return value 
of the methods p and q, respectively. S' is a statement. In the following part S 
is called a body of a fusion rule. 

Let us explain the semantics of fusion rules. Assume that the above fusion rule 
is included in the definition of a class C. Furthermore, assume that two method 
invocations pit^^ Xi,...,tx^ x^) and q(ty^ yi,...,ty^ yn) have been invoked 
to the object O but both of them are waiting for the termination of another 
invocation executed on the object O. The fusion rule specifies that the execution 
of the two invocations can be replaced with the execution of the statement S. 
The details of the semantics are given below. 

— The statement S executes concurrently with any method invocations exe- 
cuting on the object O. 

— The default receiver object of the method invocations in the statement S is 
the object O. 

— An invocation of p and an invocation of q are fused irrespective of the order 
in which they are invoked. That is, the semantics of a fusion rule remain the 
same if the two invocation expressions in the rule are swapped. Consider a 
class definition with the following fusion rule. 

fusion void p(void) & void q(void) { . . . } 

Adding the following rule to it is of no benefit. 

fusion void q(void) & void p(void) { . . . } 

— The statement S is executed by either the thread that invoked p or the one 
that invoked q. 

— Values are returned to the caller of p and that of q as follows. 

Either tp or tq is the void type: The execution of S must be terminated 
with a return statement. The value of the return statement is returned 
to one of the callers of the fused method invocations whose return value 
is not of the void type. 

Neither tp nor tq is the void type: The execution of S must be termi- 
nated with an mreturn statement, mreturn is a primitive for returning 
two values. When S terminates with the statement 

mreturn a and b, 

a is returned to the caller of the method p and b is returned to the caller 
of the method q. 
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4 Sample Programs 

GUI Event Handling. The following program fuses multiple invocations of 
the method repaint to one invocation of the method. 

class Window { 

fusion void repaint (void) & void repaint (void) { 
repaint ( ) ; 

} 

} 



Concurrent Buffers. The class Buffer makes an object that represents a 
buffer with an array. The class has the methods put and get. The code for 
checking buffer overflow and underflow is omitted. The following fusion rule 
allows us to “bypass” the manipulation of the array in the execution of a com- 
bination of put and get (we assume that buffers of the class Buffer do not give 
the users any guarantee of the order in which buffer elements are managed and 
further assume that the buffer overflow or underflow occurring when method 
fusion is absent does not need to be preserved when method fusion is used). 

class Buffer { 
int length; 

obj* elements [MAXBUFFERLEN] ; 

sync void put(obj* o) { elements [length++] = o; } 
sync obj* get (void) { return elements [ — length]; } 
fusion void put(obj* o) & obj* get (void) { return o; } 

} 

5 Discussion 

The primary purpose of method fusion is performance improvement. Though 
fusion rules can change the behavior of a program, we believe they should not. 
They should be performance hints. Otherwise a program will become error-prone 
and much less readable. 

Fusion rules that keep the behavior of a program are called transparent fusion 
rules. To put it differently, programmers cannot know whether the fusion defined 
by a transparent fusion rule actually occurred. For example, the fusion rule in 
the class Counter shown in Sect. 2 is transparent. On the other hand, the fusion 
rules that make the program show the behavior that has never been observed in 
any execution of the original program are not transparent. 

The definition of transparency depends on the definition of the behavior of 
a program and the definition of equivalence between behaviors. Consider the 
GUI code shown in Sect. 4. The fusion rule for the repaint method may reduce 
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flickers in a GUI window. If the extent to which a window flickers is included 
in the behavior of the program, the fusion rule is not transparent because it 
varies the behavior. Otherwise it is transparent. Currently, we do not give a 
strict definition to the term “behavior.” We would like to give one in the future. 

It is one of our long-term goals to make the compiler accept as many trans- 
parent fusion rules as possible and reject as many non-transparent fusion rules 
as possible. 

6 Implementation 

This section first describes a method execution algorithm that does not support 
method fusion. We then extend the algorithm to support method fusion. 

6.1 Basic Implementation Scheme 

An object lock is associated with each object. Multiple invocations of exclusive 
methods to the same object are serialized using the object lock. Before a thread 
executes an exclusive method on an object, it acquires the object lock associated 
with the object. After a thread completes executing an exclusive method on an 
object, it releases its object lock. Every method invocation is executed by the 
thread that invoked it. 

An object lock is represented by a flag, an auxiliary lock, and a doubly-linked 
queue of waiting tasks: 

Flag A flag has either of two states, FREE or LOCKED. When an exclusive 
method is executing on the object, the flag of the object is set to LOCKED. 
Otherwise it is set to FREE. 

Waiting tasks A waiting task is a data structure that represents a serialized 
method invocation (an invocation that has already been invoked but has 
not been executed due to the contention of the acquisition operations of the 
object lock). A waiting task contains a method ID and the arguments of the 
method. Below, it is also called task. 

A queue of tasks is called a waiting queue below. Flags and waiting queues are 
manipulated exclusively with auxiliary low-level locks such as spin- locks. 

A thread acquires an object lock as follows. First it reads the flag of the 
object lock. If the flag is FREE, the thread changes the flag to LOCKED, which 
means the object lock is successfully acquired. If the flag is LOCKED, the thread 
creates a task for the method invocation and enqueues it into the tail of the 
waiting queue. An enqueued task contains a synchronization data structure, 
through which threads can communicate notification. Then the thread waits 
until it receives notification from the synchronization data structure. After it 
receives one, it executes the inserted task. 

A thread releases an object lock as follows. First the thread checks whether 
the waiting queue has a task. If it has no task, the thread changes the flag 
to FREE, which means the object lock is successfully released. Otherwise the 
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class Counter { 

int value; 
public : 

sync int fa(int n) { /* fetch and add */ 
int tmp = value; 
value += n; 
return tmp; 

} 

fusion int fa(int x) Sc int fa(int y) { 
z = fa(x + y) ; 
mreturn z and z - y; 

} 

} 

Fig. 1. A sample program. The code dehnes a counter that has the fetch-and-add 
operation. 



thread dequeues a task out of the waiting queue and sends notification to the 
synchronization data structure associated with the task. 



6.2 Implementation Scheme Extended to Support Method Fusion 

Serialized invocations to an object are fused by the thread that tries to enqueue 
a task into the waiting queue of the object. Just before the enqueuing, the thread 
checks whether it can fuse the task to be enqueued (called the enqueued task) and 
the task at the tail of the waiting queue (called the tail task). Tasks in the queue 
other than the tail task are not checked. If it can fuse them, the thread dequeues 
the tail task out of the waiting queue, and then it reads the information stored 
in the two tasks and executes the body of the fusion rule applied. Otherwise, the 
tail task is not dequeued out of the waiting queue; the enqueued task is actually 
enqueued. 

What can be fused is a combination of the tail task and the enqueued task. 
The tasks in the waiting queue except the tail task are not fused. There exists 
a performance trade-off among the strategies for choosing the tasks to be fused. 
One possible strategy is to check all of the tasks in the waiting queue. Although 
this extreme strategy likely increases the number of fusions, it also increases the 
cost of checking tasks as well. It seems almost impossible to devise a strategy 
that works effectively in all kinds of programs. 

We explain the implementation of method fusion in more detail using the 
program in Fig. 1. Figure 2 shows how tasks are fused in the program^. Captions 
for the subfigures in Fig. 2 are given below. 



^ Although each thread has its own stack in Fig. 2, this is not essential. Our imple- 
mentation scheme can also be applied to systems in which different threads reside 
in the same stack (e.g., Cilk [6], StackThreads/MP [19], and Schematic [14,15,17]). 
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fa(x) 



fa(y) 



thread X 



thread Y 








Fig. 2. How two method invocations are fused. 



1. Thread Z is executing an exclusive method on the object. Thread Y en- 
queued a task that represents the invocation fa(y). Thread Y is waiting for 
notification through the data structure associated with the task. Thread X 
is trying to invoke fa(x) to the object. 

2. Thread X checks whether it can fuse the invocation f a(x) and the invocation 
fa(y). Since they can be fused, the thread dequeues the tail task out of the 
waiting queue. 

3. A frame S is pushed onto thread X’s stack. In the frame S, thread X executes 
the body of the fusion rule. A frame T is pushed onto thread Y’s stack. In the 
frame T, thread Y waits for a value to be sent by thread X and returns the 
value to the parent frame. A new synchronization data structure is created 
for communicating the value between thread X and thread Y. Thread X 
invokes fa(x+y), fails to acquire the lock of the object, and consequently 
enqueues a task that represents fa(x+y) into the waiting queue. 

4. Now thread X is executing the invocation fa(x+y). 

5. Thread X passes the return value of f a(x+y) to the frame S. 

6. Thread X executes the mreturn statement. One of the values in the mreturn 
statement is returned to the parent frame of thread X. The other is sent to 
thread Y via the synchronization data structure. Thread Y receives the sent 
value and returns it to the parent frame (as if it itself executed f a(y)). When 
the thread fuses invocations of void-type methods, the threads communicate 
a dummy value in the same way. 
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In our implementation, each thread allocates tasks and synchronization data 
structures from the area pre-allocated at thread creation time. Areas for tasks 
and synchronization data structures do not have to be allocated dynamically 
with malloc or new because at most one task and one synchronization data 
structure is required for each thread at a time. 

Our compiler does not support static fusions of multiple invocations that are 
statically determined to be called in succession. 



7 Experimental Results 



We have tested the usefulness of method fusion through experiments. We have 

incorporated the method fusion mechanism into the Amdahl compiler, which 

is a prototype version of the Amdahl compiler. The machine used in the exper- 
iments is a Sun Enterprise 10000 (UltraSPARC 250 MHz x 64, Solaris 2.7). 

We use the following benchmarks. 

Counter Each thread repeatedly invokes an exclusive method inc, which in- 
crements a counter value, to the counter object shared among threads. A 
fusion rule in the program specifies that multiple invocations of the method 
inc can be fused. 

FileWriter This program creates a file object, which keeps a file descriptor and 
is shared among threads. The object has an exclusive method strwrite, 
which writes the string given in its argument to the data to the file repre- 
sented by the file descriptor. The method flushes the buffered data in every 
invocation. Each thread repeatedly invokes the method strwrite to the ob- 
ject. A fusion rule in the program specifies that multiple invocations of the 
method strwrite can be fused. The runtime system will combine the two 
strings in the arguments of the invocations into a new string and invoke 
strwrite only once with the new string. 

FileReader This program creates a disk object, which encapsulates a physi- 
cal disk and is shared among threads. The object has an exclusive method 
f ileget, which opens a file whose path is given in one argument, reads the 
file, and stores the content of the file into the character array given in the 
other argument. Each thread repeatedly invokes the method f ileget to the 
object. A fusion rule in the program specifies that multiple invocations of the 
method f ileget can be fused if they have the same path in their arguments. 
In the program, all invocations of the method f ileget have the same path 
in their arguments. 

ImageViewer This program reads an image file and displays it in a newly cre- 
ated window pixel by pixel. The program uses the GUI Toolkit GTK+ [8]. 
At the beginning of the program, pixels in the image are partitioned among 
threads. Each thread repeatedly draws a new pixel on the window and re- 
paints the row that contains the pixel. A thread can repaint a row by invoking 
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an exclusive method repaint to the object that represents the window^. A 
fusion rule in the program specifies that multiple invocations of the method 
repaint can be fused. It fuses multiple invocations of repaint into another 
invocation of repaint that repaints the area covering the rows that would 
otherwise have been repainted by the multiple invocations. 

The benchmarks create a fixed number of threads at the beginning of the pro- 
gram. The number is the same as the number of processors. No thread is created 
during the succeeding execution. Since the amount of parallelism exposed in a 
program is the same as the number of processors, a thread that cannot immedi- 
ately execute an exclusive method waits in a busy-wait loop without switching 
to other computing tasks. In the following description of the experiments, the 
term “thread” has the same meaning as the term “processor.” 

In all the benchmarks, the number of invocations of exclusive methods is 
always the same and independent of the number of threads. 

We use at most 50 processors because the effect by other processes is often 
observed when using a larger number of processors. In the experiments running 
ImageViewer, we used a 14-processor Sun Enterprise because the GUI libraries 
required to run ImageViewer are absent in the 64-processor machine. 

Figure 3 shows the execution times of the benchmarks. We compare four 
strategies for implementing exclusive methods: spin, mutex, custom, and fusion. 
Spin represents the programs using spin locks. Mutex indicates the programs 
using mutex locks provided by the operating system. Custom represents the 
programs that use the lock described in Sect. 6 and that do not support method 
fusion. Fusion represents the programs that use the lock described in Sect. 6 and 
that support method fusion. 

Method fusion is effective in all the benchmarks except Counter: the pro- 
grams supporting method fusion show the best performance in FileWriter, 
FileReader, and ImageViewer. In Counter, there is no strategy that shows 
the best performance on any number of processors. In FileWriter, as the num- 
ber of processors increases, the execution time of fusion drops and then grow 
again. The method fileget in FileReader includes heavy operations such as 
opening a file, and hence the performance improvement ratio in FileReader is 
larger than that in any other benchmark. In the execution of ImageViewer, 
fusion shows the best performance when the number of processors is less than 
eight. In ImageViewer, not only fusion but also custom becomes faster as the 
number of processors increases. We are investigating why custom becomes faster. 

Generally, in programs that have an object on which multiple invocations 
of exclusive methods are frequently serialized, the execution time increases ac- 
cording to the increase in the number of processors [13,16]. This phenomenon is 
observed in most of the programs. 

Figure 4 shows the number of executions of exclusive methods. No fusion 
indicates the programs that do not support method fusion. Fusion indicates the 

® Since GTK+ is not thread-safe, GTK+ library functions must be called from within 
an exclusive method. GTK+ provides a mechanism that makes GTK+ thread-safe. 
However, the mechanism simply serializes invocations to GTK+ library functions. 
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Fig. 3. Execution times. 



programs that support method fusion. In Counter, there is a small reduction in 
the number of executions of exclusive methods. In FileWriter, as the number 
of processors increases, the number of executions of exclusive methods decreases 
steadily. This result means that the gradual increase in the execution time of 
fusion in FileWriter is not due to the decrease in the number of fusions, but is 
probably due to the increase in overhead. In FileReader and ImageViewer, 
the shape of the curve that represents the execution time is very similar to that 
representing the number of executions of exclusive methods. 

We report the amount of time needed for executing the exclusive method 
once. A rough approximation of the amount is acquired by dividing the overall 
execution time measured on one processor by the number of executions of exclu- 
sive methods. The amount for each of the benchmarks above is 0.25, 17, 1035, 
and 137 microseconds, respectively. 

8 Related Work 

Parallel Execution of Associative Operations. There is an extensive lit- 
erature on the techniques that extract parallelism among associative exclusive 
operations [3,5,9,12,18]. In systems using the techniques, each thread executes 
associative exclusive operations in parallel and accumulates the contributions of 
the operations in a thread-local area. The contributions of each thread are put 
together eventually. The techniques can only be applied to the regular program- 
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ming models in which it is obvious at what point the contributions of operations 
should be put together. Method fusion can be applied to irregular execution 
models in which finding that kind of point is difficult. Another problem with 
these techniques is that they do not provide a way to change executed methods 
with a modest amount of code. The existing techniques above and method fusion 
are complementary; method fusion does not obviate their use, and vice versa. 
The techniques above are useful in some programs, while method fusion is useful 
in others. 



Parallel Execution of “Non-interfering” Exclusive Methods. In sev- 
eral languages threads execute combinations of exclusive methods in parallel if 
the methods do not “interfere” with one another, and hence the semantics of 
the program are preserved even if concurrent invocation of them are not serial- 
ized [4,14,15,17,20,22]. Though their techniques are effective when the methods 
invoked in parallel do not interfere with one another, they are useless otherwise. 
On the other hand, method fusion works effectively when the methods invoked 
in parallel interfere with one another. Their techniques and method fusion are 
complementary. 



Network Combining. The Network Combining technique has been proposed 
by Gottlieb et al. in their work for the NYU Ultracomputer [7]. In this technique, 
a network switch in a parallel computer combines multiple instructions flowing 
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through the network into one instruction: only the combined instruction is sent 
from the switch. For example, two fetch-and-add instructions are combined into 
one fetch-and-add instruction. Their technique combines multiple instructions, 
keeping the semantics of sequential execution. Method fusion, on the other hand, 
combines multiple method invocations, currently making programmers responsi- 
ble for keeping the semantics of their program. 



Static Fusion of Multiple Operations through Program Transforma- 
tions. There is much literature on program transformations that fuse multiple 
operations statically [1,2,5,10,11,21]. The techniques described in the literature 
detect static occurrences of the operations that are invoked consecutively, and 
transform them into a cheaper operation statically. Our technique, on the other 
hand, detects dynamic occurrences of the operations that can be executed con- 
secutively, and it accordingly dispatch the control dynamically to a cheaper 
operation. Static fusion, unlike method fusion, does not accompany runtime 
overhead. Therefore static fusion is preferred wherever it can be used. Method 
fusion, on the other hand, can be applied to programs in which static fusion 
cannot be applied. 



Our Previous Work. Our previous work [13,16] shows a scheme for efficiently 
executing parallel programs in which multiple invocations of exclusive methods 
are serialized frequently. The scheme improves the locality of memory references 
and the performance of lock operations. Unlike method fusion, the scheme does 
not change a multiset of executed method invocations. The scheme and method 
fusion are complementary. 



9 Conclusion and Future Work 

We have described language design for method fusion and showed an implemen- 
tation scheme for method fusion. The method fusion mechanism fuses multiple 
critical sections that successively appear in a dynamic control flow across the 
thread boundaries. To our knowledge, a mechanism that can do this has to 
date not been proposed. We have confirmed the effectiveness of method fusion 
in experiments. Method fusion significantly improved the performance of the 
programs in which exclusive methods perform heavy operations such as I/O 
operations. 

Method fusion is particularly effective in two kinds of programs. One is pro- 
grams in which executing an operation twice can be replaced with executing the 
operation once. An example is a GUI program that executes repaint operations. 
The other is programs in which a pair of operations “neutralize” each other. An 
example of a pair is the put operation and the get operation for buffer objects. 

We are actively exploring several future directions for method fusion. Firstly, 
we would like to develop a framework that helps programmers describe trans- 
parent fusion rules. Fusion rules described in real-world programs will be more 
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complex than the ones given in this paper. Since static analysis techniques may 
not work well for complex fusion rules, it will be necessary to provide the envi- 
ronment that supports fusion rule debugging. Secondly, we would like to combine 
method fusion with inheritance. Thirdly, we would like to explore other imple- 
mentation strategies than just choosing the task on the tail of the queue. Finally, 
we would like to improve the current implementation so that multiple manipu- 
lations of a waiting queue may run in parallel. 
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Abstract. The technology under consideration is called “Computational Proxy- 
Server” (CPS or “Computational Portal”) and it is intended for remote 
computing. The technology makes it possible to simplify the process of access 
to high-performance computing resources via the Internet both for application- 
client and for human-client. The technology is based on distributed client-server 
architectures, agent technologies, component models, CORBA and COM 
architectures. The technology is oriented towards allotment of a primary service 
to client as a pair <software, hardware>. The approach allows to divide a pair 
<client - computing server> into two pairs: <client - CPS> and <CPS - 
computing server>. The technology is realized through a complex of software 
tools: CPS proper, HyperModeller and ^J. 



1. Introduction 

Application of heterogeneous complexes of hardware and software tools, including 
distributed high-performance computing resources, could increase efficiency of 
compound tasks development. In order to use such heterogeneous environment, it is 
necessary to solve a number of problems connected with convenient and effeetive 
computer-human interaction, with software interaction between remote hardware 
components, with reliability of distributed system while components could hang, etc. 
The Internet is natural environment for providing access to remote computing [1]. 
Last time, the web-technology, CORBA and COM architectures has been often used 
for remote access to high-performance computing resources. Component technologies 
are often used now. The technology of proxy computers is widely used for various 
Internet protocols. Technologies of visual programming and graphic user interface are 
widely spread in all areas of computer applications. 

We propose a technology, which integrates three items: (1) “Computational 
Proxy-Server” (CPS or Computational Portal) - a technology of access to remote 
computational services via the Internet [4], (2) a technology of integrated visual 
programming, which is realized in authoring tool [5], and (3) a technology of open 
visual object-oriented assembling, which is implemented in a tool HyperModeller [6]. 
The technology is based on a component “n-technology” [2, 3] using functional, 
algorithmic, and object-oriented composition, web- and mail-access. 
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2. Basic Concept of the Technology 

The software, realizing the technology, consists of 3 main items: software for client 
computer, software for proxy-computer (CPS), and software for computing server 
(CS). The common chart of CPS technology components is represented on the Fig.l. 




Fig. 1. The integrated chart of the computing proxy-server technology 

CPS fulfills the following problems: - support of “a reference bureau” for functions 
provided to the client, - accept of the orders on computing with data transfer (under 
the protocols COM or CORBA, via web-access or via email), - making decision about 
the most expedient choice of the concrete computing server for concrete computing, - 
activation of computing on a computing server, - data transfer, - monitoring of 
computing process, making decision about change of the computing server (dynamic 
reconfiguration), - accumulation of experience in reliability of computing servers, - 
filing of resources provided by computing servers, - filing of programs for CS for 
storing in the library of CPS. The main components of CPS are represented on the 
Fig.2. 

A client computer is connected with CPS via the Internet. It is equipped with 
appropriate software. The client can receive the list of provided resources, which are 
registered on CPS, can examine the state of tasks started before, and also can transmit 
an order for computing. 

Authoring tool [5] can be used as a client for CPS. This one is a tool for 
programming in Java, which can be used as stay-alone tool, as add-on for MS 
Frontpage, and as a component of HyperModeller environment). The software 
supports various chart versions of Java: Tt-chart, flow-chart, etc. Operators of a 
graphic program could not only be Java operators, but also calls of computing 
services, provided by CPS. The tool allows using of external languages as sub- 
languages for a program (according to the n-technology). 
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Fig. 2. The main components of CPS 
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HyperModeller tool can also be used as a client for CPS. This one realizes an open 
object-oriented visual assembling technology. An application is assembled as a chart. 
The chart includes some elements connected by links. The elements could be, 
particularly, computing services provided by CPS, and modules created in ’^J. 
Moreover, computational modules and input/output interface elements created in 
arbitrary languages (according to the n-technology) could be used as building bricks. 




Fig. 4. A screenshot of HyperModeller chart, which uses picture processing on high- 
performance computer via CPS 

Arbitrary mail client and web-browser can also be used as a client for CPS (see 
Fig.5). 

Fig. 6 illustrates possible relations between the tools’ components in some hypothetic 
work. An element created in one component of the tools, could be used as a ‘brick’ 
for another component. For example, an object, created in HyperModeller (as a chart), 
could be used as an element of a program in ’!/. This program could be principally 
used in its turn as an element of HyperModeller chart. Elements created in 
HyperModeller, in ’!/ and hosted at CPS can be used to implement interface via Web- 
server. 



3. Multimedia Computational Proxy-Server 

It seems that one of fmitful applications of CPS is computer graphics and image 
processing (such as video compression, generation of animated scenes or rendering), 
because productivity of user computer is often less then it is necessary for efficient 
solution of such task. CPS can support in 2 main modes for these tasks: off-line and 
on-line (real time mode). 

We call “Multimedia CPS” the application of Computational Proxy-Server 
technology for multimedia data stream processing in real time (Fig.7). 
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Fig. 5. Web-access to CPS computational service 




Fig. 6. The main components of the proposed client computer software and their interlinks. 

Compression of video and audio flow by CPS in real time allows implementation 
of CPS for video-connection of remote clients using slow client computers and slow 
Internet channel. 

As first step of the multimedia project, it has developed off-line audio- video codec 
for MPEG-4 ISO standard (the codec is realized in C-H-). The codec is registered as 
one of current CPS resources. Development of real-time MPEG-4 codec for CPS is in 
progress. The codec will use high-performance computing cluster (Institute of 
Mathematics and Mechanics Ural Branch RAS). 
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Fig. 7. Multimedia computational proxy-server 



4. Conclusion 

Experimental versions of CPS are published and operated on the Internet 
(http://cps.imm.uran.ru). 

The author thanks Serge Izmailov, Dmitry Smirnov, Vadim Kosarev, Alexander 
Petukhov, Alexander Yakovlev, and Natalia Beresneva for participation in realization 
of the software. 

The project is carried out with financial support of the Russian Foundation for 
Basic Research (code 99-01-00468). 
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Abstract. We present a new concurrent (constraint) logic programming 
language based on partially ordered event structures. A system is mod- 
eled as: (a) a set of concurrent processes, which are Prolog programs 
extended with event goals declaring program points of interest, and (b) 
a constraint store which imposes restrictions on the event goals execution 
order. The constraint store acts as a coordination entity which on the 
one hand encapsulates the system synchronization requirements, and on 
the other hand, provides a declarative specihcation of the system con- 
currency issues. This produces a powerful formalism which at the same 
time, overcome the dehciencies of traditional concurrent logic program- 
ming languages and preserve the benehts of declarative programming. 



1 Introduction 

The task of programming concurrent systems is substantially more difficult than 
programming for sequential machines. Unlike sequential (or transformational) 
programs, which merely terminate with a final result, concurrent (or reactive) 
programs produce results during their execution and may not even be expected 
to terminate. Thus, while in traditional sequential programming the problem is 
reduced to making sure that the program’s final result (if any) is correct and that 
the program terminates, in concurrent programming it is not necessary to obtain 
a final result but to ensure that several properties hold during program execution. 
These properties are classified into safety properties, those that must always be 
true, and progress (or liveness) properties, those that must eventually be true. 
Partial correctness and termination are special cases of these two properties. 
We believe that the intrinsic difficulties in writing concurrent systems can be 
considerably reduced by 

— using a declarative formalism to explicitly specify the system safety and 
progress properties, and 

— treating these properties as orthogonal to the system base functionality. 

This paper proposes a new concurrent (constraint) logic programming lan- 
guage which models a concurrent system as: (a) a set of concurrent processes, 
which are Prolog programs extended with event goals, and (b) a constraint store 
which imposes restrictions on the event goals execution order. The constraint 



V. Malyshkin (Ed.): PaCT 2001, LNCS 2127, pp. 314-318, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




Event Logic Programming 315 



store acts as a coordination entity which on the one hand encapsulates the sys- 
tem synchronization requirements, and on the other hand, provides a declarative 
specification of the system concurrency issues. 

2 Related Work 

Logic programming languages derive from the procedural interpretation of a 
subset of the first order predicate calculus. They offer a unifying style which 
allows them to be considered at the same time as specification languages, as 
formalisms for proving program properties, as well as programming languages. 
Concurrent constraint logic family of programming languages (e.g., CCP [7], 
Parlog [3], Concurrent Prolog [8], Guarded Horn Clauses [10]) preserve many of 
the benefits of the abstract logic programming model, such as the logical reading 
of programs and the use of logical terms to represent data structures. However, 
although concurrent logic programming languages preserve many benefits of the 
logic programming model, and their programs explicitly specify their final result, 
important program properties, namely safety and progress properties, remain 
implicit. These properties have to be preserved by using control features such as 
modes and sequencing, producing programs with little or no declarative reading. 

It is also worth mentioning the language Linda [2]. Linda is related to the 
work reported here in the sense that it separates application functionality and 
concurrency control by providing a model for concurrency and communication 
via a shared tuple space. However, the tuple space has no logical reading on its 
own and it is up to the programmer to give meaning to the tuples on the tuple 
space. In general, this forces the specification of a system to be low level and 
makes impossible any formal treatment for synthesizing and verifying of specifi- 
cations. There have been works on using tuple spaces as coordination mechanism 
in a logic programming framework [1,4,9] but these approaches inherit the lack 
of logical reading of the tuple spaces. 

3 The Constraint Language 

Many researchers, e.g. [5,6], have proposed methods for reasoning about tempo- 
ral phenomena using partially ordered sets of events. Our approach to concurrent 
programming is based on the same general idea. The basic idea here is to use a 
constraint logic program to represent the (usually infinite) set of constraints of 
interest. The constraints themselves are of the form X < Y , read as “df precedes 
y” or “the execution time of X is less than the execution time of T” , where X 
and Y are events, and < is a partial order. 

The constraint logic program is defined as follows. Constants range over 
events classes E,F, . . . and there is a distinguished (postfixed) functor +. Thus 
the terms of interest, apart from variables, are e, e+, e + +,..., /, /+, / + +,.... 
The idea is that e represents the first event in the class E, e+ the next event, 
etc. Thus, for any event X, X-\- is implicitly preceded by X, i.e. X < X+. 
We denote by ejv the A'-th event in the class E. Programs facts or predicate 




316 



R. Ramirez and A.E. Santosa 



constraints are of the form p{ti , . . . , where p is a user defined predicate and 
the ti are ground terms. Program rules or predicate definitions are of the form 
p{Xi,...,Xn) t— B where the Xi are distinct variables and B a rule body 
whose variables are in {Xi, . . . ,X„}. A program is a finite collection of rules 
and is used to define a family of partial orders over events. Intuitively, this 
family is obtained by unfolding the rules with facts indefinitely, and collecting 
the (ground) precedence constraints of the form e < f. Multiple rules for a given 
predicate symbol give rise to different partial orders. For example, since the 
following program has only one rule for p: 

p{E,F) ^ E < E, p{E+,E+). 

it defines just one partial order e < /, e+ < /+, e + + < / + +,.... In contrast, 

p{E,E) ^ E < E, p{E+,E+). 
p{E,E) ^ E < E, p{E+,E+). 

defines a family of partial orders over {e, /, e+, /+, e + +,/ + +, e + + + . . .}. 
We will abbreviate the set of clauses H C s\ ... H t— by the disjunction 
constraint H t— Csi ; . . . ; Csn. in which disjunction is specified by the disjunction 
operator ‘;’- 

The constraint logic programs have a procedural interpretation that allows 
a correct specification to be executed in the sense that events get executed only 
as permitted by the constraints represented by the program. This procedural 
interpretation is based on an incremental execution of the program and a lazy 
generation of the corresponding partial orders. Constraints are generated by the 
constraint logic program only when needed to reason about the execution times 
of current events. 

4 Event Goals 

In order to refer to the visit times at points of interest in the program we 
introduce event goals. An event goal has the syntax of an event name enclosed 
by angle brackets, e.g. <e>. Event goals appear in program clauses bodies. 

Given a pair of events, constraints can be stated to specify the relative order 
of execution of event goals in all parts of the program. If the execution of a 
process Bi reaches an event goal containing an event whose execution time is 
constrained to be greater than the execution time of a not-yet-executed event of 
an event goal in a different process 4*2 7 then process Fi is forced to suspend. In 
the presence of recursive definitions in a logic program event goals are typically 
executed several times. Thus, in general, an event e in an event goal G represents 
an event class, where each of its instances e,e+, e + +,... corresponds to an 
execution of G (e represents the first execution, e+ the second, etc.). 

When <e> is evaluated, the constraint store is checked to determine if event 
e is enabled, i.e., it is not preceded by another not-yet-executed event, then 
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the constraint store is updated when necessary. The constraint store is updated 
when e is enabled by deleting all primitive constraints e < E. The evaluation of 
event goals are done atomically to preserve consistency of the constraint store. 

We have adopted a design of using explicit concurrency in our language. 
Traditional concurrent (constraint) logic programming languages parallelism is 
implicit: every literal in a clause body represents a concurrent process. We believe 
that this can cause the creation of an unnecessary number of concurrent processes 
and thus, the creation of concurrent processes should preferably be done by an 
explicit primitive in the language. We provide the 1 1 operator which allows the 
programmer to explicitly spawn concurrent processes, which is similar to the &/2 
operator of &-Prolog [4]. 

5 Conclusion 

In this paper we have present a new concurrent (constraint) logic programming 
language based on temporal constraints among a set of events. In the language, 
concurrent processes are specified by Prolog programs extended with event goals. 
A constraint store contains a set of temporal constraints which impose restric- 
tions on the event goals execution order. The constraint store acts as a coordi- 
nation entity which on the one hand encapsulates the system synchronization 
requirements, and on the other hand, provides a declarative specification of the 
system concurrency issues. This produces a powerful formalism which at the 
same time, overcome the deficiencies of traditional concurrent logic program- 
ming languages and preserve the benefits of declarative programming, providing 
great advantages in writing concurrent programs and manipulating them while 
preserving correctness. 

Current status: We have a prototype implementation of our language written 
in Parlog. Parlog was chosen as the implementation language since it provides 
mechanisms for creating concurrent processes and for process synchronization. 
The implementation is reported in an accompanied paper. 
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Abstract. A time cost model for GORBA distributed applications per- 
formance is proposed and new insights on enhancing parallelism and 
performance optimisation techniques of distributed programs are given. 
Some results on measuring performance of experiment applications are 
presented. A new 4-tiered architecture based on our cost model against 
traditional 3-tiered one for Internet distributed applications is proposed. 



1 Introduction 

Rapid development of computer networks, introduction of technologies Intranet 
and universal distribution Internet services caused essential shifts in basic para- 
digms of designing of software systems that can be characterised by ever grow- 
ing demands in performance and easy programming of distributed applications. 
These fundamental changes have caused development of whole classes of software 
architectures and middleware and CORBA (Common Object request Broker Ar- 
chitecture) is one of the distinguished one among them. 

CORBA standard determines the architecture of distributed objects and in- 
teraction between them in heterogeneous networks [1]. CORBA gives a way of 
organisation of the distributed computation having a number of properties at- 
tractive for a designer such as precise object model, separation of object descrip- 
tion from its implementation and call transparency. There are known a number 
of works to investigate performance issues in CORBA and propose methods to 
improve program efficiency [2-5]. However to our knowledge they deal insuffi- 
ciently with development of quantitative performance models and programming 
methods in CORBA application design. 

The purpose of this paper is to give a model and methodology for building 
high performance distributed CORBA applications suitable for practical use in 
industrial-strength software. The work relies on our experience of research and 
development of a CORBA based enterprise distributed software system [6]. We 
give a new insight on methods of enhancing parallelism and performance opti- 
misation of distributed programs and illustrate presentation with experimental 
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measuring performance of sample applications. The paper concludes with a pro- 
posal of a new 4-tiered architecture of Internet based distributed applications 
which enriches the traditional 3 tiered one with an extra logic component aimed 
to enhance system performance by means of various methods of minimisation of 
objects interaction time. 

2 CORBA Objects Interaction and a Time Cost Model 

General prerequisites of CORBA object requests broker interactions are speci- 
fied by OMA architecture [7] as shown at figure below, where GIOP (General 
InterORB Protocol) is a protocol for data transfer between brokers. This archi- 
tecture establishes following limitations on client-server interactions: 

— interaction establishes permanent connection between the server and client 
while request processing; multiple request could be multiplexed over the 
same connection. 

— method invocation is synchronous, that is once client’s thread has executed 
remote method invocation it is blocked until the reply; asynchronous com- 
putation can be founded on existence of another (parallel) thread commu- 
nicating a server thread by means of synchronous method invocation; Note, 
that in this article we do not talk about new GORBA AMI feature, which 
provide asynchronic methods invocation interface - this is subject for another 
analysis. 

— implementation of the remote method may require sending some context 
needed for correct execution of the method, for example identifier of current 
transaction or information about codeset, used in current session; 

— stages of request processing and their order are predefined. 

Our analysis of time costs of request brokers shows that performance of a 
broker is mainly dependent on following functions underlying stages of request 
processing. 

1. Marshalling (demarshalling) function M[x) [Dm[x)) which implement 
coding (decoding) stages in request x processing. These functions are almost 
additive in space, M[x\y) = M[x)\M[y)\pad[x,y), where x\y is concatenation, 
pad)x, y) is a quantity of b^es aligning y after x. Also they are almost linear in 
time TM{x\y) = Tm{x) +Tm(j/) + (I(x, y), where Tm{x) is a time for coding x and 
S[x,y) is negligible with respect to Tm{x). Notice that the size of appropriate 
GIOP sequence |M(x)| = AsAf|a^| in b^es can be considered as proportional 
to size of request \x\ with a coefficient Ksm- Time for coding and decoding are 
considered as nearly equal. GoefRcients Ksm not depend from ORB and fully 
defined by type of request and using coding (usially GIOP). Tm{x) is depend 
from quality of ORB marshalling algorithm. 

2. Search objects function. The main parameter that this function is depen- 
dent on is amount of objects supported in the system. So the time cost for 
invocation of this function can be designated Tpo{o, No), where No is a size of 
objects table in the system. 
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3. Search methods function. Searches in table of methods with time cost 

designated as Nm{o)) , where m is method and Nm{o) is a size of table 

of remote methods of the object. 

4. Activation object and invocation method function estimated in time as 
Ti{o,m) which includes time for servant invocation and, if needed, its thread 
initialisation. 

5. Network data transfer function where Ks is mean value for a byte transfer 
time. Let there is a program code y = o.mfx) with propagating a context c, 
where m is the method of remote object o with input parameter x and output 
y. To find estimation of time cost for this basic piece of code of distributed 
applications we need to define a number of time cost model constituents that 
are characterised in terms of functions introduced above: 

— if designate req[o,m,x,c) a function of sending appropriate request for 
o.m[x) operation then coding request time can be estimated as; 

Tm {req[o, m, x, c)) « Tm (o|m|x|c) « Km X (|o| + |m| + \x;\ + |c|) 

— transferring request time: 

Ts {req[o, m,x, c)) « Kg x (|o| + |m| + |x| + |c|) 

— decoding request time: 

Tum {o,m,x,c) « Km (|o| + \m\ + \x\ + |c|) 

— time of search object in active object map, object activation, method invo- 
cation and evaluation of request: 

Tfo (o, No) + Tf^ (m, Nm (o)) + Fi (o, m) 
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— time for reply transfer: 

Ts {reply {y)) « Ts {y) ^ K'g X (|y| + |c|) 

— time for reply decoding: 

Tdm {y, c) « Km X (|j/| + |c|) 

Summarizing these time estimations, we can deduce the following timing cost 
model of CORBA remote method invocation: 

= -^1 X (|o| + |m| + \x\ + |j/|) + 2Ki X |c| + TpQ (o, N g) + 

yi-FM{m,NM{o))+ Ti{o,m) 

3 Enhancing Performance Techniques 
for Distributed Applications 

There is difficult if not impossible to give the uniform definition of concept of 
efficiency suitable for all classes of applications. In each case of interest it can 
be a particular set of criteria. In this paper we consider timing characteristics 
a crucial for performance of most distributed applications. Among them are 
known: computing performance — speed of operation execution estimated as 
general time spent by the processor on any of a step of calculation performed; 
application reactivity (responsiveness) — a time interval between input by the 
user of the data and occurrence of the new information in his client’s application; 
and efficiency — a degree of processor time utility as a share of actual calculation 
time of a task in general time of residence of the task in the system. 

Below there are presented briefly a few examples of technique to improve 
performance characteristics followed by an application example illustrating their 
effect. Basically the methods concern optimisation of service interfaces design. 

1. Using composite operations. Due to CORBA location transparency remote 
and local method invocations looks identical from application programmer point 
of view. But actually their time costs are different: in local case it is of i/'j (time 
of invocation), in remote case this can be Ts (time of network data transfer). 
To reduce overhead an aggregation of multiple nonlocal method invocation into 
a composite one is exploited as shown in following code fragment. Let we have 
two subsequent invocations on remote object: 

yl=o .reql (xl) ; 
y2=o .req2(x2) ; 

Define an object o operation reql2{xl,x2) as composition of regl and reg2. 
Then using our model the time cost of two sequential remote invocations is: 

Ki (2|o| + |re(/l| + |re(/2| + |Ai| + IA 2 I + |j/i| + |j/ 2 |) + 2 Ai 2 |c| 

+‘2Tpo (fVo) + 2,TpM {reg, Nm (o)) 
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while in case of composite operation we have: 

J<i (|o| + \xi\ + \x 2 \ + \yi\ + |j/ 2 | + |reijl2|) + K 2 \c\ 

+Tfo {No) + Tfm {Nm + 1) + J-'i (o,rc(/12) . 

So by using composite operations we can save time of 

Ki\o\^ K 2 \(^,+TFO {No)+TFM {reql2, Nm+i) + ^Tj (o, (reql2,reql + req2 )) ; 

that is equal to cost of empty remote invocation void f(void). This trans- 
formation can improve all kinds of the timing characteristics of performance. 
Naturally the effect from composite operation is more significant in cases of 
coupling multiple invocations like in branch operation if (o.ml(x)) o.m2(x) ; 
else o.m3(x). And the most effective this optimisation is for loop construct 
like: 

for(ULong i=0; i<x . lengthO ; +i) r[i]=o.m(x[i]);+ 

In such case of homogeneous data arrays multiple operations like m : Ox A — t 
Y can be defined as a single composite operation niseq : O x ^ with 
obvious advantages of saving execution time. Examples of such technique can 
be found in CORBA Collection Services [8]. As our model shows this transfor- 
mation is applicable (recommended) if Ts and Tm are the main constituents of 
application cost model. For implementation this optimisation is realised as server 
side composite operation equivalent to given sequence of methods invocations 
on the remote object. 

2. Nonblocking execution of coarse-grained computation in parallel threads. 
This transformation is applied primarily if it is necessary to minimise time 
of reactivity of the application program. Let we have client source code is 
y=o.m(x) ;F;ShowY(y) where time consuming operation m is carried out on a 
server, following piece of code F is data independent on y and ShowY (y) is the 
closest operation which needs computed value of verb+y+. Then it is reason- 
able not to block the client program and to transform the code with: 1) replacing 
y=o.m(x) statement by starting equivalent ope ation in parallel thread on the 
client where the method m is actually performed, and 2) inserting wait-for state- 
ment just before the ShowY(y) to protect y variable form too early evaluation. 
A pattern for transformation of the code in Java can be like following: 

RequestYO 

Thread m_th = new Thread () 

{ public void run() 

y=o.m(x); } 

}; m_th . start 0 ; 

> 

F 

waitForY 
if (lyReceived) 
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yWaiter .waitO ; 

} ShowY(y) 

> 

This example exposes the simplest case of static source code transformation 
based on local analysis of program statements data independence. Advanced 
models with extended implications for parallelism extraction have been devel- 
oped by authors in [9]. 

3. Customisation of marshalling. Cost of network transfer can be decreased by 
changing GIOP marshalling to customised one with more efficient characteristics 
by means of supply of adapter library for coding and decoding custom marshalled 
byte streams. Custom marshalling can be more efficient with respect to GIOP 
because we can use known structure of transferred data. Note that it is still 
possible to use CORBA network data transfer layer by encapsulation of the 
marshalled data stream into CORBA type sequence<octet>. This techniqe is 
also can be applied in case of object collocation [5], where we can simple skip 
marshalling /demarshalling stages . 

We developed our own stream format called RC-stream for passing of rela- 
tively large data sets of known structure through low speed network. Adapters 
for writing and reading from RC stream are available to application programmer. 

Let’s denote difference in marshalling speed algorithm as AK^^, difference in 
multiplicator of marshalled data size as AKt- 

Now, we can compare difference in execution of 2 identical requests with 2 
different marshallin constants: it would be 

{AKm + AKt * K'^){\o\ + \m\ + \x\ + \y\ + 2|c|) 

So, in ideal case parameters of marshalling algorithm must be depend from 
speed of data transfer: if we increase time of marshalling on, AKm, than appro- 
priative decreasing of request size must be bigger, than where Kg - speed 

of data tranfer in communication channel. 

It’s means, that custom marshalling is usefull in low-speed network environ- 
ments, such as Internet. 

A. Elimination of metainformation. CORBA provides reach facilities for build- 
ing high-level general schemes of object interaction based on common design 
patterns. But their exploiting usually means expensive usage of metainformation 
such as passing Any type objects with type codes or using Interface Repository 
in runtime. Metainformation transfer leads to significant overhead. So it is de- 
sirable to use high-level generic components in performance critical subsystems 
and instead to use specialisation of general schemes where all information about 
object types is static, all calls to extra interfaces are known at compile-time and 
all parameters types are concrete. 

To demonstrate the effect of our optimisation techniques some experiments 
were undertaken on 10 Mbs LAN processing a sample SQL Read-request to 
database consisting of 10000 records. The work of CORBA middle layer was 
to pass iteration over requested result set to CORBA front-end client. Identical 
front-ends clients and few different implementations of CORBA middle-layer 
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server was tested. The request was coded in different languages with and without 
applying of RC-coding, with and without applying collocation on a machine. 
Also number of records retrieved during one remote method invocation was 
varied. Following combinations of these opportunities have been tested (shown 
in diagram legend below): 1. Server(++), Client (++), server and clients are 
collocated in one address space on single computer, sequence of records are 
marshalled with help of GIOP coding. 

2. Server(++), Client (++), server and clients are collocated in one address 
space on single computer, sequence of records are marshalled with help of RC 
coding. 

3. Server (++), Client (++), server and client are not collocated (i.e. situated 
in different address spaces) , invocations are executed on single machine via LAN 
interface, sequence of records are marshalled with help of GIOP coding. 

4. Server (++), Client (++), server and client are not collocated, invocations are 
executed on single machine via LAN interface, sequence of records are marshalled 
with help of RC coding. 

5. Server (++), Client (++), calls are executed via LAN (fOb), GIOP coding is 
used. 

6. Server (++), Client (++), calls are executed via LAN (10b), RC coding is 
used. 

7. Server (++), Client (Java), calls are executed via LAN (10b), GIOP coding 
is used. 

8. Servers(++), Client (Java), calls are executed via LAN (10b), RC coding is 
used. 

Results of time measurement with Sun Enterprise 450 under Solaris 2.6 and 
Oracle database acting as server, Pentium 300 under Windows NT acting as 
client are shown on the following Diagram 1. 

On X-axis number of records passed in one remote invocation is shown, on 
Y-axis the time of processing requests and passing 1000 records in milliseconds 
is shown. 



4 Internet Applications: 

Technique of Using Web Front-End 

In this section we explain briefly the main idea of an application of our method- 
ology in analysing Internet/Extranet applications: What is different between In- 
tranet and Internet applications is the cost of network data transfer lO-lOOMbs 
for LAN and 1-10 Kbs for Internet. Thus a good design of Internet application 
implies minimisation of network data transfer time Ts, while for Intranet appli- 
cation the time of invocation Tj can be more critical. One of consequence for 
making decisions in arch tecture design is to insert additional software layer for 
collecting data passed in large pieces of information in order to speed up integral 
performance of Internet application. Such Internet case architecture with four 
layers (Database, Logic, Server Front-End, Client) can be more efficient than tra- 
ditional 3-tiered one consisting of Database, Server and Clients. Suppose that 
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for organizing data in a single chunk it is needed to process JM requests with 
method invocations of approximately equal time complexity. So for standard 
3-tiered architecture we have following assessment of time evaluation: 

= NKi [cOUSt^ |x| + |j/|) + 

Now consider the extra layer of logic where we have heterogeneous medium with 
transfer factor for external medium (Internet) and internal (LAN) asKi and 
K'l respectively. If WWW Servlet executes all invocations in LAN environment 
collecting all needed parameters with additional information and sending it to 
remote browser in a single chunk, then we obtain: 
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= Ki {const + iV|x| + N\y\ + \z\) + 

NK{ {const + \x\ + |j/|) + VVrJPo 

where is 1^1 a size of additional information added by servlet and is 
an overhead due to servlet and is an overhead due to servlet invocation. The 
difference will be: 

T — T* = K\ {{N — 1) const — \z\) — N K'{ {const + \x\ + \z\) — 

Considering that K'l << Ki and the difference is of three orders of magnitude 
we can conclude LAN expenses is usually negligible in respect to the time of 
network transfer. So the benefit T — T* is surely positive and can be significant 
if the size of additional data is not enormous \z\ < (|x| + |j/|)*N and if time of 
operation is determined mostly by time of network data transfer. 



5 Conclusion 

We have presented a time cost model for CORBA distributed applications perfor- 
mance and proposed enhancing parallelism and performance optimisation tech- 
niques of distributed programs. This paper is inspired by practical experience 
of CORBA based industrial distributed software design project undertaken at 
GradSoft (Kiev, Ukraine). Some results on measuring performance of experi- 
ment applications reported in this paper shows that the model and techniques 
is a good basis for software architecture to Is development in support of high 
performance parallel and distributed computing [10]. Particularly a new 4-tiered 
architecture against traditional 3-tiered one is proposed for Internet distributed 
applications. 
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Abstract. We analyze two implementation variants of a parallel com- 
puter algebra algorithm in Distributed Maple. 4'he original solution uses 
a manager-worker mechanism to control task scheduling, which requires 
an elaborate administration scheme. The new algorithm is based on a 
dataflow approach where all tasks are immediately started, automati- 
cally scheduled by the runtime system, and implicitly synchronized by 
task dependencies; non-determinism is effectively applied to provide more 
potential for parallelism. R turns out that the new version is not only 
more declarative (closer to the mathematical problem description) but 
also more efficient than the original solution. 



1 Introduction 

We have developed Distributed Maple [8,9] as a portable and easy to use environ- 
ment for implementing parallel computer algebra algorithms. So far, it has been 
successfully applied to the parallelization of various basic methods and complex 
applications in computer algebra and algebraic geometry [12,11,6,13]. The sys- 
tem is based on previous experience of other authors with the parallelization of 
Maple applications [14,1,2]. 

A good deal of this effort has been to design the parallel algorithm, respec- 
tively to extract parallelism from a given sequential algorithm. However, even 
after the developer has devised the abstract algorithm, he is left with a number 
design decisions when constructing the concrete implementation: how to agglom- 
erate independent activities to concurrent tasks, how to model the interactions 
between tasks, and how to schedule tasks for execution. This gives a wide spec- 
trum of implementation strategies: from a program that explicitly controls task 
execution by a manager-worker scheme to a program that just creates all tasks 
and leaves the scheduling decisions to the runtime system. 

As a matter of fact, in our previous work the decision which style to pursue 
was essentially left to the taste of the developer: some algorithms were paral- 
lelized in a high-level declarative style, while others were implemented by low- 
level imperative mechanisms. The later strategy was typically chosen because 
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of the opinion that keeping control over tasks yields a more efficient solution. 
However, due to the lack of comparative studies, there has up to now not been 
any well-founded basis for such a judgment. 

In this paper, we make good this omission by developing for a problem that 
has been previously solved in the manager-worker style of parallel computing a 
new implementation that is based on the dataflow principle: the new program 
starts all tasks as early as possible and leaves all scheduling and synchronization 
issues to the runtime system. Our concern was to make the program as simple 
and declarative as possible, i.e., to keep it close to the mathematical problem 
description. For this case study, we have chosen the problem of computing the 
Dixon resultant; this is part of the neighbGraph. function of the computer algebra 
package CASA [7] and was previously parallelized in a manager- worker style [If]. 

We have performed this work in preparation of a broader activity where 
we will compare parallel algorithms implemented in Distributed Maple with 
corresponding declarative versions written in para-functional language Glasgow 
Parallel Haskell GPH [3] which has been previously used for the parallelization 
of computer algebra algorithms [4]. We are going to use GPH as a coordination 
environment for scheduling computations among Maple kernels; for this purpose, 
we have already developed a GPH-Maple interface [10]. 

2 Distributed Maple 

Distributed Maple is an environment for writing parallel programs on the basis of 
the computer algebra system Maple [9]. It allows to create tasks and to execute 
them by Maple kernels running on various machines of a network. Each node of 
a session comprises two components (see Figure 1): 

Scheduler The Java program dist . Scheduler coordinates the node interac- 
tion. The scheduler process attached to the frontend kernel starts instances 
of the scheduler on other machines and communicates with them via sockets. 
Maple Interface The program dist .maple running on every Maple kernel im- 
plements the interface between kernel and scheduler. Both components use 
pipes to exchange messages (which may embed any Maple objects). 

The user interacts with Distributed Maple via the Maple frontend by a num- 
ber of programming commands, in particular: 

dist [start] (/, a, . . . ) creates a task evaluating f(a, . . .) and returns a task 
reference t. Tasks may create other tasks and arbitrary Maple objects (in- 
cluding task references) may be passed as arguments and returned as results, 
dist [wait] (t) blocks the execution of the current task until the task repre- 
sented by t has terminated and returns its result. Multiple tasks may inde- 
pendently wait for and retrieve the result of the same task t. 

This model is based on para-functional principles which is sufficient for 
many kinds of computer algebra algorithms. The environment also supports non- 
deterministic task synchronization for speculative parallelism and self-synchroni- 
zed shared data objects which allow tasks to communicate by a global store. 
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user interface 




Fig. 1. Distributed Maple architecture 



3 Problem and Sequential Algorithm 

Our problem can be summarized as follows [5]: let p be a bivariate polynomial 
with rational coefficients x and z, i.e. p € Q(x, z), and let z) denote the 

partial derivative of total order u-\-v. Our goal is to compute a sequence 

of univariate polynomials where bo{x) is the greatest square free divisor 

(gsfd) of p[x,0), is the greatest common divisor (gcd) of fei(x) and of all 

d“(x, 0) with M + V = i + 1, and n is the smallest number such that deg = 0. 

In the sequential algorithm, the partial derivatives of total order i + 1 are 
generated from (and overwrite) the derivatives of order i as shown in Figure 2: 

(6, n) := derivatives(p) 

i := 0 
do := p 

bo := gsfd(do(a^,0)) 
while deg(fej) ^ 0 do 
d'i+i '•= z) 

for j from 0 to i do 

dj '■= 

bi+i := h 

for j from 0 to i + 1 do 

bi+i ■■= gcd(fei+i,dj(x,0)) 
i := i I 
n := i 

end 
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Fig. 2. Computation of derivatives of total order i + 1 



4 Manager- Worker Parallelism 

The main parallelization idea is to organize the computation of all in a trian- 
gular matrix as shown in Figure 3: each line i contains all with u-\-v = i, each 
column j contains all d^, i.e., the matrix holds at position {i,j) the derivative 
■ We may thus compute all those positions {i,j) with i > j in parallel whose 
data dependencies have been resolved, i.e., for which the result at (i — l,j) is 
available (if i > j) respectively the result at (i — 1, j — 1) is available (if i = j)- 



deri\'ation w.r.t. x 



% 





Fig. 3. The matrix of partial derivations 



In order to allow an efficient implementation, the algorithm must increase the 
granularity of the parallel computation by letting each task compute multiple 
elements of the matrix: we partition the triangular matrix into square blocks of 
m? elements (for some blocking factor m) and let each task {i,j) compute the 
partial derivatives of the block with upper left corner {i,j)- The blocks along 
the diagonal boundary of the triangular matrix are themselves triangular such 
that the corresponding tasks only need to compute ^(m^ — m) elements. 

All tasks are created by a manager program which itself computes iteratively 
the d° , i.e., the derivatives along the diagonal boundary of the matrix. When the 
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manager has computed all those derivatives that represent the diagonal boundary 
of a triangular block, it starts a corresponding task that computes this block. 
Whenever a task has terminated, the manager starts a new task for computing 
that square block that is adjacent to the lower boundary of the result block. 

Actually, the result of a task need not be the values of all d" in the corre- 
sponding block because we are only interested in 

1. the last line of the block which is required by the task computing the adjacent 
block (this result need not be returned to the main program but can be put 
into a shared space from which the other task can retrieve it); 

2. the greatest common divisor of each line of the block which is required to 
compute the greatest common divisor of the whole matrix line (this result 
is returned to the manager program). 

Since the gcd is commutative and associative, the program may receive in any 
order the results computed by the tasks of line i and combine them with the 
current value of In a final step, bi^i is then combined with bi. 

We utilize the p processors by the following scheduling strategy: Initially, 
the manager creates tasks for the first p triangular blocks. Whenever a task 
terminates, we “enable” the task that computes the adjacent square block; if a 
terminated task has computed a triangular block (and it was not one of the p 
initial tasks), we also enable the subsequent triangular block. Among all enabled 
blocks, we choose a block with minimum line index (its computation may make 
the computation of blocks with larger indices superfluous) . When the termination 
criterion is detected in line i, only those tasks will be started that operate on lines 
with indices less than i; when no more task is active, the algorithm terminates. 
The algorithm can be formalized as follows: 

(6, n) := derivatives(p) 

T := {task [im, im) : i = 0 . . . p — 1} 
n := deg(p) 

while "i ' 7 ^ 0 do 

wait for some task {i,j) € T and remove it from T 
update bi . . . hi+m-i and n 
enable [i + m, j) 

if i = j then enable (i + m, j + m) end 
disable all {i,j) with i >n 

if there is some enabled {i,j) with minimum i then 
disable it and add task {i,j) to T 

end 

end 

for i from 0 to n — 1 do 

h+i ■= gcd(fei,fei+i) 

end 

end 

We have implemented this algorithm in Distributed Maple [11]. Figure 4 
illustrates by one example the dynamic behavior of the implementation (with 
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Fig. 4. Manager-worker parallelism 



block size m = 2) on 20 Linux PCs with various processors connected by switched 
100Mbit Ethernet lines: the left “machine diagram” displays on the vertical axis 
each machine participating in the session; each line in the diagram represents a 
task executed on a machine at a particular time. The right “utilization diagram” 
lists on the vertical axis the number of machines busy at a particular time. The 
sequential execution of this particular example took on a PIII@450MHz PC 552s; 
the corresponding parallel computation took 27s. 

While the parallelization achieved significant speedup (mostly gained by the 
fact that the modified order of gcd computations turns out to be more efficient 
than the one in the sequential algorithm), the utilization diagrams show that 
there is much room for improvement: rarely all 20 machines are busy, in average 
only about 50% of the computing resources are utilized. Apparently, the manager 
is not able to issue parallel tasks at a rate that is sufficiently high to provide idle 
machines with work, i.e., the explicit task scheduling becomes a performance 
bottleneck. It also does not help to increase the block size: while the manager is 
then no more the bottleneck, simply too few tasks are generated to saturate all 
processors and the overall execution time increases. 

5 Dataflow Parallelism 

We now explore a new parallel solution where tasks are implicitly scheduled for 
execution whenever their data dependencies have been resolved. Having recog- 
nized the utilization problem, we also try to exhibit all parallelism inherent in 
the problem. In a nutshell, we are now heading for a declarative approach where 
the manager program only describes the collection of tasks that have to be ex- 
ecuted and leaves the imperative details to the runtime system. Our goal is to 
have a program of bigger elegance that shall also yield good performance. 

We start the new design by reorganizing the decomposition of the triangle 
matrix of partial derivatives as shown in Figure 5: we partition the matrix into 
uniform square blocks of size rn^ (for some blocking factor m) such that each 
block can be identified by the coordinate {i,j) of its upper point denoting dl 
and contains all d” with 0 <u < i-\-m and 0 < v < j m. Once dl is known, all 
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Fig. 5. Data dependencies 

other elements of the block can be determined by derivation with respect to x or 
z, respectively. Like in the original algorithm, we compute within a block the gcd 
of all with same sum tt + w which then contributes to the result vector bu^v ■ 
Unlike in the original solution, the lines/columns of each block run diagonal to 
the computation of the derivatives as sketched by the grey arrow in Figure 5. 

Since all elements in a block can be computed from the first element , we 
can start the computation of a block when one of the following condition holds: 

— i = 0 and is known. 

— j = 0 and d^_-y is known. 

— i > 0 and j > 0 and 

• is known or 

• d\_-Y is known. 

In other words, while a block (0,j) at the right boundary of the triangle 
depends on its unique upper neighbor (0,j — m) (because it needs and a 

block (i,0) at the left boundary depends on its unique upper neighbor (i — m,j) 
(because it needs d^_j^), a block {i,j) in the inner of the triangle depends on 
one of its two upper neighbors [i,j — m) and (i — m,j) and can be computed 
whenever one of these neighbors has delivered d|^^ or d|_j^. 

Let nx be the degree of p in x, let be the degree of p in z, and define 
n := min(n,j;, n^). Since d^ and d”* are constant, we know that (the gcd of 
bn-i and of all d(( with u-\-v = n) is constant and that therefore the computation 
terminates with some k < n. We therefore need to create only tasks {i,j) for 
which i + j < «; in the course of the computation it may turn out that k < n 
and that therefore those tasks {i,j) with i j > k are not required any more 
and may be stopped. We denote by 

tasks(fc) := {{i,j) : i + j = k Ai mod m = 0 Aj mod m = 0} 

the set of all task positions {i,j) such that i-\- j = k. 

The manager program starts the tasks in some order compatible with the 
task dependencies and passes to each task (apart from the initial one which 
receives the input polynomial p) the references to those tasks that the new task 
potentially depends on. Then the program waits for all tasks {i,j) in the order 
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of increasing i + j, combines the computed gcds with the vector b and signals by 
updating n whether the computation can be prematurely terminated. All tasks 
whose results are not required any more are then stopped: 

(6, n) := derivatives(p) 

n := min(n^, n^) 

for k from 0 to n by m do 

for {i,j) in tasks(fc) do 

if i = 0 A j = 0 then ti j := start task(i, j, m, p) 
else if i = 0 then tij := start taskK,(i, j, ra, tgj-i) 
else if j = 0 then tij := start taskL(i, j, m, 
else ti j := start taski(i, j, m, 

end 

end 

for k from 0 to n by m do 

tset := {ti^j : {i,j) € tasks(fc)} 
while tset ^ 0 A n = min(nj;, do 
t := select tset 
:= wait t 

update bij^j . . . and ti 

tset := tset-{t} 

end 

end 

for k from n + m to min(n^, n^) by m do 
for {i,j) € tasks(fc) do stop tij] end 
end 

for t G tset do stop t end 
end 

The initial task computes, starting with p, the first square of derivatives and 
returns the “boundary derivatives” and (for use by the neighbor 

tasks) and the gcd of the derivatives (for the manager program): 

{pi,P2,g) •■= task(z, j, m, p) 

compute all d" with i<u<i-\-raAj<v<j-\-ra 

{pi,P2) := (rf'+„,rfr™) 

for k from 0 to 2m do 

gj. := gcd{d)( :i<u<i-\-mAj<v<j-\-mAu-\-v = k} 

end 

end 

The other tasks can then be formalized with the help of above description: 

{Pi,P2,g) •■= taskL(i, J, m, t) {pi,P2,g) ■= taskR(i, j, m, t) 

{{P,-},-) '■= wait t '■= wait t 

{Pi,P2,g) := task(i, j,m, If) {pi,P2,g) := task(i, j, m, |f ) 

end end 
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{Pi,P2,g) ■•= taski(z, j, m, tr, tZ) 
t := select {tr, tl} 
if t = tr 

then {pi,P 2 ,g) ■■= taskR(i, 
else {pi,p 2 ,g) :=taski^{i,j,m,t) 

end 

end 

An “inner” task selects non-deterministically the result of one of the two 
neighbor tasks (whichever is available first) and, depending on the selection, 
proceeds like a task on the right boundary respectively on the left boundary. 

Actually, a task {i,j) needs not return the boundary results and 

to the manager program (which is only interested in the gcd vector g); instead 
it may put the results into a shared space from which the other tasks may re- 
trieve them (actually a task can do this ahead of the rest of its computation such 
that a subsequent task becomes enabled before its predecessor has terminated). 
Consequently, each task needs not receive references to other tasks but only the 
locations of shared data where it may find the boundary results. The main pro- 
gram creates these locations and passes to each task the location of its own result 
and the locations of the results it depends on. To avoid race conditions between 
producer and consumer, such a shared data item needs to be self- synchronized: 
as long as it is empty, the consumer gets blocked on an attempt to retrieve its 
content; the consumer is released only when the producer provides the content. 
The main differences of the new algorithm to the original one are as follows: 

1. Matrix Decomposition: the new algorithm uses a more regular decom- 
position. This has the advantage that each task computes the same number 
of derivatives rn^ the price that each task returns a gcd vector of double 
length. However, the new decomposition strategy is not a characterizing fea- 
ture; both strategies can be applied in both algorithms. 

2. Task Synchronization: the new algorithm uses implicit synchronization 
based on task dependencies. It thus becomes much simpler than the original 
one where dependencies were explicitly maintained to schedule exactly one 
task per processor at every time. Thus the new algorithm should yield more 
parallelism and overcome the utilization problem discussed in the previous 
section. On the other side, creating many tasks at once (many of which may 
become activated and blocked) means larger memory requirements. Thus 
greater simplicity and better processor utilization at the price of larger mem- 
ory usage are the main distinguishing features of the new solution. 

3. Result Values: in the new algorithm, each task returns two derivatives and 
a gcd vector as a result; the old algorithm returned a vector of derivatives 
together with the gcd vector. This is an inefficiency in the original algorithm 
that was not previously recognized because of a focus on the behavior of the 
sequential algorithm where all but one derivatives in a line are constructed 
from the previous line by derivation with respect to variable x. It would have 
also sufficed to communicate a single derivative (the left lower corner of the 
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Fig. 6. Data dependencies and sample dependency paths 



triangle/matrix) from one task to its successor, since the fist line of each 
square block can be computed by derivation with respect to z, 

4. Result Timing: A task may put the computed boundary polynomials into 
a shared data space and thus enable the computation of two neighboring 
tasks tasks before it performs the remainder of its computation which is 
not required to enable other tasks. In this way, tasks may become active 
earlier and the utilization increase faster than in the original algorithm. As 
an illustration, the non-shaded areas in the left diagram in Figure 6 denote 
those elements of the derivation matrix which are not required to enable any 
task to compute its results. Again, a similar optimization might have been 
performed in the original algorithm. 

5. Non-determinism: the new algorithm does not specify which of the two 
possible predecessor tasks returns the boundary polynomial that enables the 
further execution of an inner task. An inner task may thus be enabled by 
different paths of task dependencies as illustrated by the right diagram in 
Figure 6 that shows two possible paths that enable a particular task. 

6. Speculation: the new algorithm exhibits more speculation than the original 
one, i.e., it starts more tasks whose result may turn out to be not required 
any more . In the original algorithm, on each of the p machines at most 
one task was active of a time, such that at most p — I tasks may become 
superfluous. In the new algorithm, about all tasks are initially created before 
starting to check for termination. Thus, depending on the actual input, many 
of the created tasks may become superfluous. 

How far the speculated differences between original and new algorithm are ac- 
tually relevant, is experimentally investigated in the following section. 



6 Experimental Results 

We have implemented in Distributed Maple the algorithm described in the pre- 
vious section (including the optimization of using shared data spaces). The new 
implementation required about 140 lines of Maple code (70 lines for the main 
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program) while the original one required 200 lines (140 for the main program); 

i.e. the size of the source code was reduced by 30% (50% for the main program). 

Behavior. We have benchmarked the program in our local network with 20 
Linux PCs with various Pentium class processors connected by switched 100MBit 
Ethernet lines; such a “Beowulf cluster” is currently the most suitable one for 
Distributed Maple applications. We have executed the sample benchmark prob- 
lems as described in Section 4 and illustrate in Figures 7 two executions with 
block sizes m = 4 and m = 5 (which generally gave best results). 





Fig. 7. Dataflow parallelism (block sizes 4 and 5) 



All in all, the algorithm proceeds in the following phases: 

1. all tasks are started and get immediately blocked (short activities on the left 
boundary of the machine diagram followed by empty space), 

2. the first task is executed and lets other tasks resume execution which in turn 
let other tasks resume while the early tasks terminate again (task gap in the 
upper left part of the diagram); 

3. more and more tasks get executed and utilization increases to maximum 
(task gap is filled), 

4. many tasks are executed yielding maximum utilization (large block of tasks), 

5. a few leftover tasks remain to be executed and the utilization drops to one 
(empty space on right boundary of diagram). 

Figure 7 shows that the frontend node (top line) has several tasks executed at 
the very end of the computation while no more other nodes are active; this comes 
from multiple tasks that were early scheduled to the frontend and got blocked on 
data dependencies such that these tasks had to be later resumed and completed. 
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Comparing these diagrams with those for the old algorithm in Figure 4 illus- 
trate dramatic differences: In the new algorithm with m = 4, after the startup 
phase which takes 20% of the execution time, all 20 machines get saturated and 
compute tasks for 65% of the whole computation time. After that, the utiliza- 
tion curve drops rather sharply such that the final phase takes only 15% of the 
computation. The overall utilization rate is about 75%. In the m = 5 case, the 
startup phase yields earlier higher utilization rate; however, load imbalances in 
the later phase (caused by slower machines) let the utilization curve drop earlier. 



Example 1 : Time(s) 




processors 



Example 1: Speedup 




processors 



Example 2: Tlme(s) 




processors 



Example 2: Speedup 




processors 



Example 3: Tlme(s) 




processors 



Example 3: Speedup 




processors 



Fig. 8. Execution times and speedups 



Execution Time. The actual execution times of the new algorithm in compar- 
ison with the execution of the old algorithm are listed in Figure 8: the top row of 
diagrams shows the execution time of the programs, the bottom row shows the 
speedup that the new algorithm gains over the original one. The comparison is 
based on the algorithm variants labelled as follows: 

— O: the original algorithm. 

— N4, N5: the new algorithm with non-deterministic task selection in the main 
program and in taskj with block sizes m = 4 and m = 5. 

— D4, D5: the new algorithm with deterministic selection in the main program 
(t := first(t5et)) and in task; (t := tl) with block sizes m = 4 and m = 5. 






Manager- Worker Parallelism versus Dataflow 



341 



All variants of the new algorithm are considerably faster than the original 
version with an average speedup (compared to the original version) of 1.8 (Ex- 
ample 1) respectively 2.2 (Examples 2 and 3). We could thus reduce the sequen- 
tial execution time from 552s, 198s, respectively 1798s on a PIII@450MHz PC 
down to a parallel execution time of 14s, 6s, respectively 29s. The fact that the 
speedups are drastically superlinear is caused by the changed order in which the 
greatest common divisors are computed (also in the original parallel algorithm); 
this is a hint for a general algorithmic improvement of the sequential algorithm. 
Block Size. Comparing the algorithm variants N4 and N5, we see that for 
smaller processor numbers (p < 12) N5 is better while N4 seems to have some 
advantage for larger processor numbers. This result is consistent with our pre- 
vious observation that larger block sizes may cause some load imbalance in the 
final phases of the computation (which is more significant for larger processor 
numbers). A similar trend can be seen when comparing D4 and D5. 
Non-determinism. Eor analyzing the effect of non-determinism on the changes 
of task dependencies, we have counted for a sample run with 20 machines the 
number of times that the non-deterministic task synchronization operation was 
called and how often it was not the first task that was selected because its result 
was not available (i.e., how often non-deterministic task selection actually made 
a difference compared to the deterministic variant): in the main program, it was 
in 44% of all non-deterministic calls not the first task that was selected; in taski, 
this was in f9% of all calls the case. Thus there is a significant difference in 
synchronization dependencies, albeit two times more in the main program than 
in the individual inner tasks. 

Eor analyzing the effect of non-determinism on the actual execution times, we 
compare the algorithm variants N5 with D5 and N4 with D4. We see that N5 is in 
many cases about f0% faster than D5, the application of non-deterministic brings 
a significant (but not dramatic) improvement in performance. In the comparison 
N4 versus D4, this effect is not that significant but still visible especially for 
smaller number of processors. 

Eigure 9 illustrate the differences in the dynamic behaviors of D5 and N5 
by the trace of an execution with 20 processors. The difference between non- 
deterministic and deterministic execution is visually captured by additional 
“gaps” in the traces of individual machines for the deterministic variant; the 
non-deterministic variant keeps after an initial phase processors continuously 
busy. The large block sizes becomes especially significant in the later phases of 
the algorithm where few late tasks may hamper the completion of the algorithm. 
The smaller the block sizes are, the better the overall utilization becomes also 
in the final phases of the computation. 

Speculation. In all three examples, the amount of speculation (tasks whose re- 
sult was not required) was not significant. The only tasks {i,j) that ever became 
superfluous where those with maximum i + j, i.e., those at the bottom of the 
triangular matrix (because other tasks in that row already yielded a gcd with 
constant degree). 
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Fig. 9. Deterministic vs non-deterministic selection 



7 Conclusions 

We have rewritten a Distributed Maple application from a low-level imperative 
to a more high-level declarative style and compared the results. This has demon- 
strated that the dataflow style of computing with implicit task synchronizations 
can considerably improve the performance: we have achieved a speedup of 2 com- 
pared to the original solution where the program explicitly schedules tasks for 
execution. While the original version is not able to saturate all processors, the 
dataflow solution yields good utilization even for a larger processor number. We 
have also shown that a declarative solution that uses non-determinism in order 
not to over-specify task dependencies may yield a performance improvement of 
10% compared to a more imperative program that constrains data dependencies 
further than algorithmically required. 

Both results demonstrate that also in Distributed Maple a more declarative 
solutions that does not care about scheduling decisions may be more efficient 
than one where the programmer tries to keep explicit control over all aspects of 
the computation. Thus parallel declarative programs need not be a priori less 
efficient than parallel imperative ones. The presented work has been performed 
in preparation of a study that will compare parallel algorithms implemented 
in Distributed Maple with corresponding declarative versions that will (on the 
basis of a recently developed Haskell-Maple interface) use the para-functional 
language Glasgow Parallel Haskell GPH. 
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Abstract. The paper presents a communication interface Coin. The in- 
terface is developed in the Research Institute for Systems Studies of Rus- 
sian Academy of Sciences. The interface is intended for building high per- 
fomance distributed computer systems, massive parallel processor com- 
puters and clusters. 



1 Introduction 

High perfomance parallel computer systems require more powerful communica- 
tion subsystems than the conventional IPC tools of ordinary operating systems 
can provide. Specialized communication interfaces are being developed, that ac- 
cords best to the classes of tasks and quality of inter-process communications 
inherited in high performance parallel computer systems. 

Communication interface Coin represents a kind of such communication sub- 
system. The interface is developed with taking into account the expirience of a 
number of firms dealing with such systems. Coin is based on such open standards 
and specifications as Virtual Interface (VI), InfiniBand, and Myrinet. 

Virtual Interface is an open specification developed by Intel, Compaq and 
Microsoft [!]. The VI represents an architecture for the interface between high 
perfomance network hardware and computer systems. The goal of this architec- 
ture is to improve the performance of the distribited applications by reducing 
the latency associated critical message passing operations. This goal is attained 
by substantially reducing the system software processing required to exchange 
messages compared to traditional network interface architectures. 

The InfiniBand Archeticture Specification is developed by a group of vendors, 
with Intel, IBM, Compaq, Dell, Hewlett-Packard, Microsoft and Sun Microsys- 
tems among them [2]. InfiniBand describes a first order interconnect technology 
for interconnecting processor nodes and I/O nodes to form a system area network 
[3]. 

Myrinet represents the architecture of a high-performance inter-computer 
packet switching network [4]. Specification of Myrinet was developed by Myricom 
Inc. The standard includes, either directly or by reference, the specification of 
the Data Link level, timing information, character set, signals, and the details 
of the connectors. 
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2 Architectural Scope of Coin 

Traditional network architectures does not provide the performance requred by 
modern distributed applications, largely due to the host-processing overhead of 
kernel-based transport stacks. These problems are addressed in the Coin archi- 
tecture by moving the network much closer to the application, increasing its 
functionality, and better matching its features to application requirements. The 
Coin architecture looks like a four-layer communication stack (see Fig. 1). 
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Fig. 1. Four-layer communication stack. 



The API layer includes applications running in parallel on the nodes of the 
system. The Transport layer is represented by ports. The ports are global ob- 
jects through which applications interface with each other. The Data link and 
Physical layers can be of two types. In case the interfacing ports are located 
within the same operating system, these ports communicate through the con- 
ventional IPC, called shared memory. In case the interfacing ports are located on 
different nodes of the computer, the ports communicate through the Myrinet-like 
communication media. 

In the traditional network architecture, the operating system virtualizes the 
network hardware into a set of logical communication endpoints available to 
network consumers. The OS multiplexes access to the hardware among these 
endpoints. In most cases, the operating system also implements protocols that 
make communication between endpoints reliable. This model permits the inter- 
face between the network hardware and the operating system to be very simple. 
The drawback of this organization is that all communication operations require 
a call or trap into the operating system kernel, which can be quite expensive 
to execute. The demultiplexing process and reliability protocols also tend to be 
computationally expensive. 

The Coin architecture eliminates the system-processing overhead of the tra- 
ditional model by providing each consumer process with a protected, directly 
accessible interface to the network hardware - a Port. Each port represents 
a communication endpoint (see Fig. 2). A process may own one or multiple 
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Fig. 2. Port-based comminucation model. 



ports exported by one or more network adapters. A network adapter performs 
the endpoint virtualization directly and subsumes the tasks of multiplexing, de- 
multiplexing, and data transfer scheduling normally performed by an OS kernel 
and device driver. An adapter may completely ensure the reliability of commu- 
nication between interfacing ports. Alternatively, this task may be shared with 
transport protocol software loaded into application process. The adapter has a 
direct access to the virtual memory space of processes involving Coin. Adapter 
can perform transfers directly from virtual memory of one process into virtual 
memory of other process eliminating the need for calls or traps into the operating 
system kernel. 

The Coin is a message-passing interface. All data transfers are done through 
message passing between ports. Interface supports two kinds of communication 
operations: 

1. Send/Receive 

2. Remote Direct Memory Access 

In the first case, the communicating processes should be aware of each other, 
and each control structure for Send operation on the sending side should be ac- 
companied with the corresponding Receive descriptor on its peer at the receiving 
side. The RDMA operations can be performed just remotely without need for 
notifying the peer process. These operations are faster than operations of the 
Send/Receive model. Alternatively, the RDMA operations can be performed 
with notifications of the peer process. 

The Coin has three levels of reliability. Different ports within one process can 
have different levels of reliability. Communications can be done only between the 
ports with the same reliability levels. 
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Ports can communicate either in datagram mode or in mode with logical 
connection. Besides, any port can be assigned a logical partition number. Only 
ports from the same logical partition can communicate. 

3 Adapter — Intellectual Communication Controller 

An adapter performs a substantial part of communication functions. A genere- 
lized model of adapter is shown in Figure 3. The model corresponds to the model 
of adapter in the Virtual Interface architecture [1]. 




Fig. 3. Generelized model of adapter. 



A host processor has access to adapter through the memory region of the 
I/O memory space. Processor can access only a register set of the ICC, adapter 
memory and doorbell regions. 

The adapter has direct access to the virtual memory space of the processes. 
Adapter can handle the following objects in the memory: 

— operation descriptor queues 

— completion queues 

— buffers of shared memories 

— global interrupt queue 

These objects can be directly accessed by the adapter, and thus they must be 
registered in the operating system. Being registered in the operating system 
means that the page of memory is protected from swapping and reordering, and 
information on the virtual to physical address translation for this page must be 
placed into the adapter. 
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Adapter can generate hardware interrupts to the host processor. These inter- 
rupts are traped and handled by the corresponding kernel agent of the operating 
system. 

Functionally adapter includes the units as follows: 

— Host System Interface 

— Common Control Unit 

— Ports Contexts Storing and Processing Unit 

— Packet Transmit /Receieve Unit 

All these units cooperating with each other can effectively perform communica- 
tion operations of different kinds in partially autonomous mode. In this mode 
the host CPU is underloaded from performing a substantial part of communica- 
tion interface functions. However, there is a mode when the host CPU controls 
all the functions of the adapter. In this mode the CPU deals with registers of 
the adapter and manages all data flows through the adapter. 




Fig. 4. Architecture of Intellectual Communication Controller 



Architecturally adapter includes Intellectual Communication Controller 
(ICC) and local memory arrays. All communication functions are implemented 
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by the ICC. Memory arrays are used for temporary packet storing and for stor- 
ing control and service information, In addition, the memory is used by the 
RISC-core of ICC for program and data storing. 

ICC includes the following units: 

— System Bus Interface 

— Communication Media Channel Interface 

— Local Memory Controllers 

— RISC-core 

— Data-Flow Machine 

Communication functions can be implemented exclusively with the RISC- 
core of ICC at micro-code level. In addition, the most time critical functions 
can be implemented in hardware. The Data-Flow Machine represents a set of 
hardware implemented communication functions. 

4 Conclusion 

A communication interface Coin was presented. The interface can be used for 
developing parallel computer systems, massive parallel processor systems and 
clusters. The interface represents an architecture different from the traditional 
network architectures. The architecture to the better extent satisfies the de- 
mands of high perfomance parallel computer systems executing strongly tighted 
distributed tasks. 
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Abstract. We propose a design of a simple tool that can be used by 
a distributed application to discover the relevant network information 
dynamically. The simplicity is a key design feature: the tool can be used 
without multiple modifications of the application code. The timely notifi- 
cation of the application is performed using a callback mechanism which 
minimizes the application idle time. The network information is gathered 
and analyzed simultaneously with application execution. We show that 
empowering an application with a knowledge of network characteristics 
provides insights into possible application adaptation mechanisms and 
into the causes of communication delays. 



Keywords: network information collection, callback application notification, applica^ 
tion adaptations. 



1 Introduction 

The n^d for distributed resources has been widely accepted in scientific com- 
munity for performing computationally-intensive tasks. Applications are made 
portable across various interconnecting technologies. However, for a distributed 
application it is difficult to attain a good performance on different interconnec- 
tions since they are often shared among several communicating programs. The 
performance of the interconnections varies with static (configuration) character- 
istics and dynamic network conditions that change depending on the network 
load and communication distance. At present, the majority of the network proto- 
cols permit no reservation of network resources for the application use. A growing 
number of distributed applications is computationally-intensive scientific appli- 
cations which have no means to learn the network information and to request a 
particular amount of network resources. For such and many other applications, 
it is desirable to have a mechanism that provides the network information trans- 
parently to the application programmer or user, so that the burden of handling 
the low level network information is shifted to a network developer. Many re- 
search projects have focused on this pressing and complex task (see, e.g., [2], [7], 
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[3] , and [11]) targeting different network architectures and configurations. With 
a knowledge of network performance, the application may adapt itself to per- 
form the communication more efficiently. The adaptation features are, of course, 
application-specific. For a scientific application, it may be beneficial to perform 
more local computations (iterations) waiting for the data to arrive [10]. For a file 
transfer, an adaptation may consist of choosing a server with a better connection 

[4] . 

Our first goal is to supply an application with the network information only 
if this information becomes critical, i.e., when the values for the network char- 
acteristics to be observed fall outside of some feasible bounds. The feasibility is 
determined by an application and may be conveyed to the network information 
collector as parameter. Note that these bounds might not be attained under the 
particular network conditions since the end-to-end reservations are difficult to 
enforce in the “best effort” network protocols with different administrative sys- 
tems owning parts of the network. This selective notification approach is rather 
advantageous both when there is little change in the dynamic network charac- 
teristics and when the performance is very changeable. In the former case, there 
is no overhead associated with processing unnecessary information. In the latter, 
the knowledge of the network may be more accurate since it is obtained more 
frequently. 

The second goal is to augment the application execution with this knowledge 
of the network while requiring minimum modifications of the application and 
without involving the user /programmer into the network development effort. 
We accomplish this goal by using callback mechanisms, implemented similarly 
to the description in [6]. 

This paper is organized as follows. Section 2 describes the design of our 
network information collection and justifies the design choices made. Section 3 
presents a few experiments with a user application. The concluding remarks 
appear in Section 4. 



2 Design of Network Information Collection and 
Application Notification (NIC AN) 

We consider a host-based design in which the information about the network is 
collected in the endpoints. The primary justification for this approach is that it 
does not require any access to the routers from the user and assumes no particular 
software configuration on a router. Thus application programmers can easily 
utilize our tool without network manager’s help. The first aspect of the design 
is that each host, which participates in the distributed computation, may have 
its own NIC AN that alerts the host when certain events happen in the network 
connection. When initializing NIC AN on a computing node, an application may 
request NICAN to monitor one or several network characteristics, such as an 
effective throughput on an external (network) interface of the node or the latency 
between this node and a neighbor node participating in distributed computation. 
Such characteristics may be passed as parameters to NICAN, and thus form a 
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“multirequest”, i.e., a single request that contains a number of network events 
which NICAN is capable of monitoring. 

The signaling mechanism of NICAN delivers network information in a timely 
fashion such that there is no instrumenting of an application with, say, call- 
queries directed to the network interface. In fact, the initialization of the NICAN 
tool may be the only non-application specific modification required in the ap- 
plication code to interface with NICAN. The application is alerted only if the 
network characteristics monitored fall outside the certain bounds, which could 
be either inserted in the multirequest by the application at NICAN initializa- 
tion or taken as defaults by NICAN. Such types of bounds as maximum and 
minimum values of the effective network bandwidth (throughput) and latency 
seem the most common among distributed applications and are often sufficient 
for an application to make decisions on proper adaptations to the network con- 
ditions. Thus our design enables application notification with the information 
based on the boundary values of these types. Similar to the case described in [6] , 
our design n^ds a new signaling mechanism that passes to the application the 
information processed by NICAN. (This is not always possible with the stan- 
dard signaling techniques of an operating system such as Unix.) If NICAN sends 
a signal to an application, then an application may need to engage its adap- 
tive mechanisms. To minimize changes inside the application code, we propose 
to encapsulate application adaptation in a notification handler (signal handler) 
invoked upon the signal receipt. This signal handler can contain an adaptation 
code with a possible access to some application variables. One way to implement 
this access is to use shared memory paradigm as provided by Unix. 

Once initialized by an application, local NICAN runs independently from the 
application. Therefore it may probe the network as often as deemed necessary 
without causing an application to wait for the result as it happens in the query- 
based mechanisms. The callback approach decouples the network analysis from 
the application execution which may lead to more precise results of the analysis. 
Multiple probes of the network are recorded to estimate the network performance 
over a longer period of time. They may also be useful for the prediction of network 
performance in such common cases as when an iterative process lies at the core 
of application. 

In NICAN, a process of collecting the network information is separated from 
its other functions, such as notification, and is encapsulated into a module that 
can be chosen depending on the types of the network, network software con- 
figuration, and the information to be collected. For example, assume that the 
current throughput is requested by an application during its execution. Then, if 
the network has the Simple Network Management Protocol (SNMP) [5] installed, 
NICAN will choose to utilize the SNMP information for throughput calculation. 
Otherwise, some benchmarking procedure - more general than probing SNMP 
but also more costly - could be applied to determine the throughput. To de- 
termine the latency between two hosts, the system utilities such as ping and 
traceroute can be used. NICAN collects latency independently of throughput. 
Thus, the collection of these two network characteristics is performed simulta- 
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neously if the information on both latency and throughput is requested. The 
modular design enables an easy augmentation of the collection process with new 
options, which ensures its applicability to a variety of network interconnections. 

Figure 1 summarizes our design by depicting a general host-based view of 
NICAN architecture, which consists of the interface betw^n NICAN and appli- 
cation, the notification mechanism, and the data collection and analysis compo- 
nents. Solid arrows represent the initialization of NICAN from an applications 
and the launch of the collection module from NICAN. The flows of the appli- 
cation multirequest, NICAN signals, and shared memory accesses are shown as 
dashed arrows. A detailed information on the design and implementation can be 
found in [1]. 




Fig. 1. NICAN arcMtectuie 



3 Experiments 

We have built a prototype of NICAN that uses several different procedures- 
modules for network information collection. This makes the tool useful for a 
variety of network types and available network management software. Here we 
present a set of experiments in which NICAN calculates the bandwidth available 
to a host by using the information provided by the Simple Network Management 
Protocol. The SNMP agent can be installed on a host, thus requiring no router 
access by the user. SNMP collects a set of data in the internal database (MIB). 
Our prototype of NICAN polls the database periodically. (The time period of 
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polling is determined automatically by NIC AN.) Then NIC AN sends a signal 
to an application according to the specified criteria, which is passed as parame- 
ters to NICAN. In this experiment, we have used the following criterion: Report 
“peak’^ throughput if it is above 8. 5 Mbps AND if the ehange in throughput is 
less than 1 0% AND Report throughput only onee- We call this criterion the first 
peak throughput criterion. The number 8.5 Mbps has been taken based on the 
characteristics of the network which we used for the experiments. It is a Local 
Area Network accessed using Ethernet protocol with nominal bandwidth of 10 
Mbps. In Figure 2, the interconnection of the hosts we used for testing is pre- 
sented. NetPIPE [8] is considered as an example of user application. NetPIPE 
is a network benchmarking program that utilizes the network heavily. Specifi- 
cally, this application sends over a TCP connection a message of increasing size 
and measures the time of its delivery, thus calculating the effective throughput, 
which includes also TCP and network overheads. For a distributed application 
that uses Message Passing Interface (MPI) [9] (on top of TCP), the commu- 
nication overhead also includes the overhead for MPI. Since most of the high 
performance computing applications use MPI to ensure portability across dis- 
tributed environments, measuring and monitoring the MPI overheard may be 
useful for performance tuning. Thus NICAN provides a way to interact with 
MPI-based distributed applications. 




Fig. 2. Intercorniection, of the hosts used in the experiments 



The bandwidth measurements taken by NetPIPE and NICAN are shown in 
Figure 3. Note that the throughput calculated by NICAN is almost always an 
upper bound on the bandwidth calculated by NetPIPE. This can be explained 
by the presence of the transport layer protocol TCP overhead in the measure- 
ments by NetPIPE. At the same time, the bandwidth calculated by NICAN is 
the actual number of all incoming and outgoing packets on the external network 
interfaces divided by the polling interval. The difference between the measure- 
ments is especially pronounced in the beginning of execution, for small messages, 
and when the bandwidth limitations are reached for large messages which n^d 
segmentation. Note that the bandwidth measured by NetPIPE is roughly the 






Design of a Tool for Providing Dynamic Network Information 355 



same regardless of the LAN topology, whereas the effect of an extra hop (via 
host B in Figure 2) is noticeable compared with the higher bandwidth values 
recorded by NICAN along the direct H2-H3 link. 



Throughput vs Elapsed Time Throughput vs Elapsed Time 




Fig. 3. Simultaneous NetPIPE and NICAN bandwidth measniements: between host 
HI and H2 (left), between host H2 and H3 (right) 



In our experiments, having the bandwidth delivered by NICAN as a (close) 
upper bound on the bandwidth of the application ensures that the application 
does not exhaust the network capacity and can make timely adaptations. We 
have supplied NetPIPE with a notification handler to react to the first peak band- 
width signal delivered by NICAN. In particular, the handler stops the growth 
of the transmitted messages so that the maximum bandwidth perceived by Net- 
PIPE is sustained without excessive consumption of computational and network 
resources. Figure 4 zooms on the occurrence of the first peak bandwidth noti- 
fication. The bandwidth values measured by NetPIPE remain very close with 
and without invoking adaptation mechanism (solid and dashed lines, respec- 
tively, in Figure 4, left), which indicates that the peak bandwidth is reached 
for a particular message size. NICAN (dashed-dotted and dotted lines) detects 
the peak bandwidth and notifies NetPIPE at about tp = seconds. On the 

other hand, for NetPIPE, the times to transmit a message (Figure 4, right) differ 
greatly beyond tp. 

For the MPI version of NetPIPE (called NetPIPE-MPI), the adaptation is 
even more important. As s^n in Figure 5 (left), adaptive NetPIPE-MPI (solid 
line) predicts more accurately the maximum throughput of the link. The poor 
prediction of the non-adaptive application (dashed line) can be attributed to the 
effects of the MPI buffer configuration, which is system- and implementation- 
specific. In particular, during the long periods of buffer handling, the link load de- 
creases while the time recording continues, ff the MPI buffering delays are over- 
lapped with the communication of some other data, then the throughput (dash- 
dotted line in Figure 5, right) recorded by NICAN increases reaching nearly the 
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link capacity. Note that the presence of another communicating program (in this 
case, the TCP version of NetPIPE, the starting point of which is indicated by 
the circle 0 in Figure 5, right) affects the calculations done by NetPIPE-MPI 
since NetPIPE-MPI is unaware of competing communication program. On the 
other hand, the NICAN output suggests that the first-peak throughput criterion 
is achieved and that the adaptations based on this criterion can be invoked. 



Throughput vs Elapsed Time Transmission time of message vs Elapsed Time 




Fig. 4. Performance of NetPIPE with and without invoking adaptation mechanism: 
bandwidth measurements between HI and H3 (left) and time to transmit a message 
(right) 



4 Concluding remarks 

We have outlined a design of Network Information Collection and Application 
Notification (NICAN) that emphasizes simplicity of use, modularity, and a call- 
back application notification mechanism. The tool can estimate the required 
network parameters either by polling an existing network management software 
or benchmarking the network connection. A selective notification of an applica^ 
tion is implemented using a callback mechanism which provides the ability to 
pass as parameters the required criteria when the tool is initialized within a given 
application. Our experiments performed with a tool prototype already show the 
potential of NICAN. In the future, we plan to conduct extensive experiments on 
the tool- application interaction, provide a more sophisticated network parameter 
analysis, and focus on the support for heterogeneous computing platforms. 
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Abstract. Array-OL, developed by Thomson Marconi Sonar, is a pro- 
gramming language dedicated to signal processing. An Array-OL pro- 
gram specifies the dependencies between array elements produced and 
consumed by tasks. In particular, temporal dependencies may be spec- 
ified by referencing elements that belong to an infinite dimension of an 
array. 

A basic compilation strategy of Array-OL on a workstation has been 
defined. This basic compilation does not allow the generation of effi- 
cient code for any Array-OL application; specifically those defining in- 
finite arrays. We propose to transform such applications to hierarchical 
Array-OL applications that may be compiled with Array-OL basic 
strategy. We introduce a formal representation of Array-OL applica- 
tions, which is a relation between points of Z" spaces; code transforma- 
tions are applied at this level. In this paper we show how the transfor- 
mation process is used during the compilation phase of a representative 
application. 



1 Introduction 

Array-OL^, developed by Thomson Marconi Sonar [5], is a programming lan- 
guage dedicated to signal processing (SP). The SP application doamin is char- 
acterized by systematic, regular, and massively data-parallel computations. 

Array-OL applications are edited in a graphic environment of specifica- 
tion. Array-OL application specification is built on two stages: a global stage 
describes the application through a directed graph where the nodes (tasks) ex- 
change arrays; a local stage details the calculations performed on the array ele- 
ments by each node. An Array-OL application directly expresses dependencies 
between elements of arrays. In particular, temporal dependencies may be speci- 
fied by references to elements along an infinite dimension of an array. Section 2 
presents Array- OL language. 

Array-OL compilation targets both dedicated embedded multi-processor 
computers and workstations or clusters of workstations in a purpose of simulation 
and debugging of applications. 

^ Array-OLU is a trademark of Thomson Marconi Sonar. It stands for Array Ori- 
ented Language. 
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A basic strategy of compilation of Array-OL on a workstation was defined. 
This compilation mechanism can not handle all Array-OL applications. In 
particular, the manipulation of arrays of infinite dimension is impossible and 
the use of large arrays is expensive. This compiler is presented in Section 3. 

We propose tools to transform Array-OL applications. A given Array-OL 
application program will be rewritten such that the initial version of the compiler 
is able to handle (or, at least, to handle more effectively). 

These transformations work at the level of a formalism of relations between 
points of Z" spaces: the ODT^. The representation of an Array-OL task by the 
ODT defines the links/dependencies between the elements of the output arrays 
of the task and those of the input arrays. The transformation principle is detailed 
in Section 4. Section 5 illustrates the approach with an example. The ODT and 
their manipulations are studied in Sections 6 and 7. Section 8 compares our 
approach with others. 

2 The Array- OL Language 

We briefly introduce the main characteristics of the Array-OL language [5]. 
An Array-OL application is made up of a task hierarchy [7]. The tasks are 
themselves data-parallel: they handle arrays. 

An Array-OL application is successively expressed in two models. A first 
global model defines the task scheduling in the form of dependencies between 
tasks and arrays. A second local model details the elementary actions the tasks 
realize on array elements. 

2.1 Global Model 

The global model defines and names arrays and tasks. The arrays are used to 
organize the dependence graph of tasks on a level: each task takes its inputs 
from the defined arrays and produces one or more arrays. 

The task specification and the detail of the array element usage are hidden 
at this specification stage. 



Array: A Structure for Signal Processing. SP applications are organized 
around a regular and potentially infinite stream of data. Array-OL captures 
this stream in arrays with a possible infinite dimension. 

Some spatial dimensions of arrays used in SP correspond to sensors. Such 
sensors may be organized in a circle. Consequently, Array-OL array dimensions 
wrap around. 

2.2 Local Model 

For a given task, the local model specification details the operations and accesses 
to input and output arrays. 

^ Operateurs de Distribution de Tableau (Array Distribution Operators). 
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An Array- OL task links its output array elements to its input array ele- 
ments. The role of the task is to produce all the values of its output arrays. 

These values are produced through patterns. A pattern is a subset of the 
elements of one array. An output pattern is produced by applying the code 
associated to the task on patterns of the input arrays. So, a task implantation 
consists of an iterator constructor; these iterations are independent. 

Since Array-OL is restricted to the specification of SP applications, the 
shape of the patterns, the array tiling by the patterns, and the task code are 
dedicated to this domain: ad hoc specifications are proposed. 

Fitting: Pattern Definition. Patterns are arrays. Equidistant elements in a 
pattern are equidistant in the array. 

A pattern may be defined by an origin in the array and a set of vectors 
(fitting vectors; one vector is associated to each dimension of the pattern). The 
other points of the pattern are defined in the array by shifting the origin along 
the fitting vectors as much as required by the pattern size. 

Paving: Tiling of an Array with Its Patterns. Two equidistant output 
patterns are produced by two equidistant input patterns. 

The array paving with patterns is given by a first pattern in each array and 
a set of paving vectors. The other patterns are defined by a shift of the initial 
pattern along the paving vectors as much as needed in order to cover the master 
array. By definition, two patterns of an output array may not overlap. 

Component Library. For each paving iteration, a task extracts the input 
pattern from the input arrays and applies a function on these patterns to produce 
output patterns. These patterns are then stored in the output arrays. 

The task is either a new hierarchy of tasks or an elementary transformation 
(ET). A library of predefined ET is available for usual signal operations (EFT, 
integration...). An ET takes patterns as input and returns patterns; it may be 
parametrized, for example by the size of the patterns. 

2.3 Array-OL Specification Language 

Array-OL is a specification language. The programmer specifies dependencies 
in both models. In the global model, the dependencies between tasks are given 
by the input and output arrays. In the local model, the dependencies are given 
in term of patterns. 

In this context, the compiler starts directly from these dependencies to gen- 
erate code. 

3 The aol2c++ Compiler 

The aol2c++ compiler is used to produce C++ code in order to execute an 
Array-OL application on a workstation by straightforward translation. This 
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strategy limits the set of Array-OL applications the compiler may handle. The 
Array-OL code transformation presented in Section 4 will allow us to widen 
the set of Array-OL applications covered by the compiler. 

3.1 Execution Scheme of an Array-OL Application 

The simulation of an Array-OL application on a workstation reads the input 
arrays and produces the output arrays in the file system. The intermediate arrays 
are allocated in memory. 

Infinite arrays are handled by slides of n values. The value of n is supplied 
on the command line at execution time. 

The simulation triggers the Array- OL tasks in an order computed by a 
dependence analysis. An Array-OL task is an indivisible execution unit: it is 
fully executed before the next task starts. 

3.2 Main Structure of the Generated C++ Code 

A C++ function is generated for each Array-OL task. This function is para- 
metrized by the input and output patterns of the task and the possible param- 
eters of the task. 

An ET code is fetched from the component libraries. 

For a hierarchical task, the function locally defines and allocates the inter- 
mediate arrays. A dependence analysis produces a scheduling of the sub-tasks. 
The code of each sub-task consists of allocating the input and output patterns 
and iterating over the paving. The body of the loop: 

— copies the array points in the operand patterns; 

— calls the function corresponding to the task; 

— copies the output pattern points in the arrays. 

In particular, the generated code that manages the read and write of a pattern 
takes a paving iteration vector and iterates on the fitting, using the following 
formula to compute the index of an array point: 

[Aip.q -\- Aif.d -\- O) mod m, (1) 

q and d design the paving and fitting iterators; Aip and Aif design the paving 
and fitting matrix; m and O design the array dimensions and origin. 

3.3 Code Generation Key Points 

We detail some interesting points of the general compilation process. 



Array Scanning. An Array-OL application specification does not impose 
an order on array element iterations (especially for multi-dimensional arrays). 
The compiler must ensure a coherent order with the allocation of the array in a 
virtual paged memory. 
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Static Shortcut of Pattern Copy. Input /output patterns of a hierarchi- 
cal task are only useful to read/write patterns of the sub-tasks (sub-patterns). 
Therefore, no copy of the patterns is needed at this level. The sub-patterns are 
built directly from the arrays: the sub-task receives references to the whole ar- 
rays associated with the origins of the current patterns. At compile time, we 
combine the two specifications of the paving/fitting of the task and sub-task to 
produce a new paving/fitting specification. 

In this context, a sub-task code is no more independent of the calling task. 
A sub-task used in different contexts will be cloned for each of its uses. 



Point Coordinate Computation. The computation of the coordinates of an 
array point is based on the equation (1). Nevertheless, the computation of the 
coordinates for the set of points corresponding to a pattern, is not performed by 
a systematic application of the matrix product: an incremental computation is 
implemented. 

The coordinates of a point on a given iteration dimension is produced from 
the coordinates of the previous point with the paving vectors increments T' . 
These vectors are computed, at compile time, from the fitting vectors T and the 
pattern sizes D: — '^j^i Dj x Jy-. 



Modulo Usage. Array-OL arrays are wrapped around. A modulo operation 
is necessary on the coordinates of all points of an array. The cost of this modulo 
is prohibitive. 

A simple calculation [2] allows us to identify whether a set of points obtained 
by Cartesian iterations of vectors cause array overflows or not. This restricts the 
set of arrays for which a modulo operation is needed. The property is checked 
at compile time for the whole array, and also at runtime for each pattern. 

3.4 Limitations and Extension of the Compiler 

The aol2c++ compiler has a number of limitations. We illustrate these limita- 
tions and introduce the strategy chosen to get rid of them. 



Task Unity and Infinity Handling. An Array-OL task is executed from 
beginning to end before another task begins. As a consequence, the operands of 
a task must be completed before the triggering of the task. 

The fact a task waits for its argument completeness limits the use of infinite 
arrays taht in Array-OL. Only Array-OL applications made up of a single 
task may handle arrays with an infinite dimension. 

The execution scheme is expensive; other execution strategies may be con- 
sidered: a “pipeline” execution will trigger a task on a part of its operands to 
produce a part of its results; these results may be used by another task to com- 
plete a part of its work... Such an execution does not necessitate a full allocation 
of the intermediate arrays. 
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Compiler Extension. In order to be able to deal with a wider set of Array- 
OL applications (applications handling huge or infinite arrays), we propose to 
keep the basic compiler strategy implemented in aol2c++ but to operate a pre- 
liminary transformation on the Array-OL source. 

This transformation step will produce, from a given Array-OL program, an 
Array-OL program that the aol2c++ compiler may effectively handle. 



Our Intermediate Language, ARRAY-OL/aol2c++. The set of Array-OL 
sources that aol2c++ may compile defines a subset of Array-OL. We use this 
subset as an intermediate language in our code transformation /compilation pro- 
cess. This approach offers numerous advantages. 

First, we propose basic transformation operators of Array-OL code. The 
operators may be applied interactively in the frame of the Gaspard environ- 
ment [3]. The programmer may then evaluate the quality of the transformation 
(visualization of the memory size needed for the execution of a task, etc.). This 
interactive semi-automatic usage is an experimental platform that allows the 
definition of transformation strategies. 

Application compiled by aol2c++, i.e. applications produced by the code 
after the transformation process, consists of a main loop over the time. The 
body of this loop is itself made up of a linear loop that accesses array elements. 
This is a good formalism for applications which aim for a parallel execution. 
Furthermore, the form of the code naturally identifies sequences of tasks which 
produce a result (i.e. patterns) from an input pattern. This property may be 
exploited to map the arrays on a distributed memory. 

In particular, this formalism is used to implement Array-OL applications 
on a dedicated architecture developed by Thomson. 

Finally, this method allows a code transformation to change the scheduling of 
an application without having to rewrite a new implementation of the Array- 
OL compiler. 

4 Array-OL Code Transformation Principle 

We propose a transformation operator of an Array-OL application that in- 
troduce supplementary hierarchy levels in the application. We consider a set of 
operand patterns that are able to trigger a sequence of tasks, each one producing 
the whole number of patterns. Operand arrays are then cut into macro-patterns'. 
macro-patterns are subsets of operand array elements allowing a task to produce 
at least one output pattern. 

New loops on macro-patterns ensure the processing of the whole operand 
arrays (Figure 1). The definition of these loops on the macro-patterns relies on 
an extension of the paving and fitting notions, namely the macro-paving and 
macro-fitting. 

Such a strategy has already been implemented at Thomson Marconi Sonar 
on some applications in order to implant them on a dedicated architecture. We 
automate the transformations. 
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(a) 




(b) 



Fig. 1. Rewriting of (a) an Array-OL application of three successive tasks in (b) a hi- 
erarchical Array-OL task composed of three sub-tasks. Observe the possible reduction 
of the intermediate array size. Rectangles denote arrays. They are cut into patterns 
which are used and produced by tasks (denoted by arrow-like polygons) 



The validity of the hierarchization relies on the dependencies: input macro- 
patterns of an iteration must contain enough points to allow the computation of 
the task sequence of the hierarchy and to produce the output macro-patterns. 

The ODT formalism is a representation of the dependencies between input 
and output of an Array-OL task. The principle is to code a set of Array-OL 
tasks with ODT, to transform these ODTs, and to find an ODT form of an 
Array-OL hierarchy. 

5 Array-OL Code Transformation Example 

As explained above, our main goal is to automatically compute a hierarchy from 
a set of tasks. We are going to illustrate our transformations on a representative 
example of signal processing. We will produce just one hierarchy from an initial 
sequence of two tasks. 

5.1 Beam Forming 

The application consists of providing frequencies and location correlations (so 
called beam) from a continuous flow of data. It is based on elementary signal 
transformations: TFT (Fast Fourrier Transformation) and discrete integration. 

— The Hydrophones, an (h = 1024 x T = oo) array, is the input of the appli- 
cation. It delivers a continuous flow of data from a set of 1024 captors. 

— The first task computes FFT for each captor and period of 512 units of time. 
It fills Frequencies, an (/i = 1024 x T = oo x f = 256) array. 
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— The second task computes a beam for each period, frequency and set of 
captors (one captor out of 4 in a cyclic linear range of 64). It outputs Beams, 
an (t = 1024 x T = co x f = 256) array. 

The direct aol2c++ compilation produces code that cannot be run. The ap- 
plication specification must be transformed. 




Fig. 2. Sequence of the three arrays ((a) Hydrophones, (b) Frequencies, (c) Beams). 
One input/output pattern and the paving directions have been represented for each 
array. On Frequencies, the compact pattern is written and the sparse cyclic one is read 



5.2 Code Transformations 

Break of the Flow. As mentioned above, the temporal dimension should be 
broken into recurrences in order to link partial execution of the two tasks. In the 
example, the recurrence length on the Hydrophones temporal axis would be 512: 
it would compute all FFTs of a period. Each recurrence will then provide the 
full hydrophone and frequency dimensions of Frequencies for the given period. 
Eventually the second task will compute the whole beam dimension for the 
period value under consideration. 



Reduction of Temporary Data. The two sub-tasks generated from the initial 
tasks work now on finite arrays. Nevertheless, other cuttings could alleviate the 
memory requirements. 

The computation of a Beams pattern requires an input pattern on Frequen- 
cies. This pattern computation needs the 64 EFT of the corresponding period 
(one FFT by hydrophone). Thus, each iteration will produce a single beam and 
needs only an (/j = 64 x / = 256) intermediate array. 



Reduction of Redundancies. The major problem of the previous solution 
comes from the redundant computations. First of all, the first sub-task produces 
the full frequency dimension on Frequencies; it implies that the second sub-task 
may compute the beams for all frequencies. On the other hand, some input 
patterns of Frequencies overlap on hydrophone dimension. For the whole appli- 
cation, the same FFT is recomputed 63 times. To avoid this, all overlapping pat- 
terns are gathered in a single pattern. Each period is divided in 4 input /output 
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meta-patterns: the input patterns are the sets of 256 hydrophones (one in 4); the 
output patterns are the sets of 256 beams (one in 4) with the whole frequency 
dimension. The Frequencies sub-array stores the results of the 256 FFT. 

In this scheme, there are no redundancies and the intermediate array size 
is (/i = 256 X / = 256). If this size is less than the memory resources, this 
transformed application represents the best balance between memory usage and 
computation overhead. 



6 ODT Representation of an Array-OL Task 



The ODT formalism allows the specification of dependencies between input and 
output operands of Array-OL tasks [4,6]. 

The set of the ODT is built by composition of basic operators. Each operator 
defines a mathematical relation between two Z” spaces. These relation operators 
look like filters which cut, let through, or duplicate input links. An input point 
may be mapped with zero, one, or several (and even an infinity of) output points. 
Relation operators are presented in Table 1. 



Table 1. Basic relation operators of the ODT 
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The relation operators are close to Array-OL characteristics: the gauge 
defines the arrays boundaries; the projection combines paving and fitting vector 
sets; the shift comes from the shifting of the origin; the modulo is used for toroidal 
array dimensions. 

The relation operators of replication and segmentation are respectively the 
symmetries of modulo and projection. (The gauge and the shift are their own 
symmetric.) 

The ODT representation of an Array-OL task consists of two expressions 
that represent: the links between the iteration space and, at one side, the operand 
arrays and, at the other side, the resultant arrays. Each of these expressions is the 
composition of a gauge that limits the paving and fitting iterations, a projection 
made of the paving/fitting vectors, a modulo on the array sizes, and possible 
shift depending on array origins. For example, the ODT of the first task of the 
beam forming (Section 5) is: 
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The two iteration spaces are homogenized (by gauge normalization and intro- 
duction of zeros in projection matrix): 
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It is then possible to compose the operand ODT with the symmetric of the 
result ODT. The resulting expression is just a link from the output points to the 
input points: 
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The general ODT form of an Array-OL task is: 
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7 Composition of Two Array-OL Tasks 

Consider two consecutive tasks ' 1 \ and T2: ' 1 \ produces an array A2 from an 
array Ai, T2 produces A3 from A2. The ODT of the tasks are: 




A hierarchization consists in merging the two tasks, in order to find a task 
Ti2 that directly produces A3 from Ai. This task will be composed of two sub- 
tasks T[ and i/2, transformations of the original tasks. The ODT of T12 is the 
composition of ' 1 \ and T2 ODT: 




The hierarchization process will transform this expression into an ODT form 
of an Array-OL task ( 2 ). The transformation is detailed in [ 2 ]. 

The outline of the transformation is to produce a symmetric segmentation 
Vres,i and to join the result with the following projection \ Vop,2 \ - This is a 
legal transformation because of the constraints and limitations of Array-OL. 
Nevertheless, the operation may produce non integer values. Therefore, we gather 
several patterns and consider a bounding box to retrieve an integer form. 

We compose the resulting form with the original paving/fitting matrix Top,i 
and Ares, 2- The paving and fitting parts of this iterator becomes the paving and 
fitting of the task T\2- The fitting part is also split in two parts to form the 
paving of the two sub-tasks T[ and T^- 

8 Related Work 

The Array-OL language and the ODT formalism belong to linear algebra, 
integer programming and constraint systems. Tools other than the ODT may be 
used in that context: 

— An Alpha [ 12 ] application is defined by a system of afRnes recurrence equa- 
tions (SARD). To implement such applications on systolic architectures, in- 
teractive transformations are considered such as changes of basis and toggling 
between broadcast and pipeline. The system may also generate a scheduling 
and an allocation of arrays. 
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— Pips [11] is a Fortran?? automatic parallelizer. It includes dependence 

analysis, code transformations and SPMD code generation. 

Both of the above use a formalism based on polyhedra. Indeed, since a poly- 
hedron defines a space area by bounding it with a set of affine hyper-plans^, it 
provides a reasonable represention of an iteration set of a loop nest. 

Several software packagess handle polyhedra. The ones that are usefull for 
compilation have to handle parametrized polyhedra defined by integer con- 
straints. The PolyLib [13] handles parametrized rational polyhedra (image by 
affine function, convex hull, integer points count...). PiPS, ALPHA and others 
rely on this library. On the other hand, PiP [8] solves parametrized integer pro- 
gramming problems; it is used by several automatic parallelizers such as Paf [9], 
Bouclette [1] and SuiF [10]. 

Although the ODT are less expressive than the polyhedra (with the excep- 
tion of the notion of modulo), they are sufficient to formalize the Array-OL 
language. Moreover, our transformation process produces Array-OL source. 
Restricting us to a formalism closer to Array-OL simplifies the finalization of 
the transformations. 

9 Conclusion 

Array-OL is a parallel language dedicated to signal processing. A code trans- 
formation strategy is implemented to overcome the limitations of a first basic 
compiler (inability to handle infinite arrays, poor performance on huge arrays). 

We have proposed a formalism to represent Array-OL applications. In this 
formalism, we defined a basic transformation operator of an Array-OL code 
into a hierarchical Array-OL code. This transformation has been implemented 
in Gaspard, a graphical environment for Array-OL application specification. 
It allows us to interactively transform Array-OL tasks. 

Several representative applications have already been transformed. Signifi- 
cant gains have been reported. 

From these experiments, we are developing strategies to automatically apply 
this operator in the rewriting of a whole Array-OL application. 

The proposed code transformations produce an Array- OL hierarchy of 
tasks. The iterations of this hierarchy are independent: we have found a number 
of independent flows equal to the number of patterns in the array. We are de- 
veloping such code transformations to control the mapping of arrays on a given 
number of threads/processes. 
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Abstract. Principles for coordination and composition of parallel/distributed 
programs are discussed. We advocate a synchronizing shared memory model 
(EDA) for coordination and an algebraic approach to building programs using a 
linking language (LL) based on module composition, restriction and renaming. 
A prototype system ErlEda illustrating these principles is described. The system 
uses the concurrent programming language Erlang and its distributed 
environment as a basis. We illustrate the approach using the Dirichlet problem. 



Introduction 

An increasing number of computation intensive applications require the power of 
parallel computers, and a variety of distributed applications already make a large 
impact on modern society. However the complexity of programming parallel/ 
distributed systems is a significant obstacle, making development of new applications 
for parallel/distributed computer systems costly and error-prone. This motivates the 
search of new concepts and platforms that would enable parallel/distributed programs 
to be conveniently designed from sequential components without sacrificing either 
efficiency or robustness. In this work we propose and demonstrate the use of new 
principles for the composition and coordination of parallel/distributed programs. In 
this paper we concentrate on software engineering and execution efficiency aspects. 

We consider a programming model for parallel and distributed computing that 
combines aspects of the shared memory programming model and the object-oriented 
paradigm. The shared memory model defines a shared address space that can be 
accessed by processes via ordinary loads and stores, thus providing convenient but 
unstructured communication between processes. To maintain desired ordering in 
parallel execution and preserve essential data and control dependencies in a parallel 
program, synchronization such as locks, flags and barriers must be provided. 

The object-oriented paradigm states that an object can carry a thread of execution 
and that the object's state is encapsulated in variables that are local to the object even 
though some may point to other objects. The state of the object can be observed and 
changed by some external thread only via message passing. Object orientation 
supports structured design and efficient software development. Achieving high 
execution efficiency for such programming models in a parallel/distributed 
environment is however real challenge. 

A basic idea of the model introduced in this article is to unify some local variables 
of interacting objects to be shared among them. Such variables constitute a state 
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common to these objects. The variables are accessed via accessor funetions that 
synchronize implicitly according to the type of the variable. The linking of objects via 
shared variables allows efficient combination of fine-grain communication and 
synchronization among object threads similar to dataflow exeeution. 



Coordination: The EDA Model 

Assuming that a parallel program eontains a number of cooperating entities or 
proeesses, the question arises how these entities coordinate their work. A number of 
models have been proposed, including various message- passing schemes and shared 
memory approaches. The Linda tuple space model [4] offers another approach. The 
term “coordination language/model” was coined in connection with Linda [5] . 

We propose using the EDA model [14, 10, 13] that provides a unified approach to 
shared memory, synchronization and communication, in other words, to coordination. 
Shared data is provided but can be accessed only in a constrained way. There is no 
global shared address space available to the processes; eaeh process can access certain 
shared variables, or acquaintances [1], on a “need to know” basis. 

Following [10, 13], shared variables are of three different synchronization types, 1- 
data, X-data, and S-data, which all impose different constraints in the way accesses 
may be performed. I-data are used for enforcing data dependency: a read operation on 
an empty variable will lead to suspension, and assignment is allowed only once. The 
coneept of 1-data is inspired from, but not identical to, I-structures [3]. A write 
operation on a full variable is discarded, thus supporting a kind of OR-parallelism. 

X-data are used for mutual exclusion and synchronous communication: reads and 
writes must be performed in a strictly alternating sequenee. A process attempting an 
aceess violating this order will be delayed until another process has changed the state 
of the accessed variable. S-data allow stream communication. A writing process can 
assign successive values to an S-type variable; these will be queued and available for 
read operations. Each read removes one value from the stream. Processes accessing a 
stream need not be suspended, except in the case of reading from an empty stream. 

A slightly more general model is aehieved if the type restriction is removed, i.e. if 
all access operations are allowed on a shared variable. For instance, a producer of a 
stream can conveniently achieve flow control by interspersing Sstores with Xstores at 
suitable intervals. This approach was taken in the mEDA model [12] where in 
addition Sfetches from an empty stream are non-blocking (returning the value 
‘empty’), and operations Ufetch/Ustore have been added for unsynchronized accesses. 

Like Linda, EDA provides a medium for coordination between processes that do 
not know of each other. This gives a elean separation between the executing 
components of a program and promotes reuse. Efficiency is an issue of coneem when 
using Linda, since reading or extracting data requires a matching process. As noted 
e.g. in [5], the open nature of the communication medium eompromises security. Both 
these drawbacks are absent with EDA. Compared to conventional shared memory, the 
introduction of different synchronization types and/or operations enables more 
effieient implementations, avoiding or reducing memory eoherence overhead whieh is 
a heavy burden in large-scale shared memory systems with eonventional semanties. 
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Composition 

In this section we discuss the issue of how to compose a program. We advocate an 
algebraic approach where program modules may be combined to form new modules. 

A module is a set of processes and shared variables. Some of the variables are 
exported i.e. their names and associated type information are visible to the outside, 
whereas the remaining variables are hidden. The module may also have imports, i.e. 
shared variables that are referenced by processes in the module but are not contained 
in it. Two modules A and B may be combined to a new module if their exports are 
disjoint, i.e. have distinctly named variables. Some imports required by A may be 
satisfied by B and vice versa. An executable program is a module without imports. 

To formalize, let us call the sets of exports and imports of a module M exps(M) 
and imps(M), respectively. The combination of modules A and B is denoted A I B. 
Then 

• A I B is defined if exps(A) n exps(B) = { } (the empty set) 

• exps(A I B) = exps(A) u exps(B) 

• imps( A I B) = imps(A) \ exps(B) u imps(B) \ exps(A) 

The combination A I B can be seen as the juxtaposition of an instance of A and an 
instance of B, and the connection of identically named imports and exports. 

Combination is the main operation of our language LL (linking language) for 
composing programs [11]. Other operations are restriction and renaming. The 
language can be seen as a form of process algebra restricted to static operations [7, 8]. 

Let A denotes a module and s a set of names. The restriction of A by s is denoted 
A\s and is the same module as A except that variables whose names are included in s 
are no longer exported. This operation is essential for information hiding. 

Let A denotes a module and I a one-one mapping of names to names. We write 
A[I] to denote a module equal to A but whose imported or exported variables have 
been renamed using I. 

The linking language LL allows hierarchic composition of modules and facilitates 
reuse of software. Primitive modules are modules not composed of other modules. 
Given a set of primitive modules for an application area, different application 
programs can be built using LL by combining instances of these modules. 



A Prototype System 

This section describes a prototype system ErlEda for parallel program development, 
based on the principles described above. Our description concentrates on the 
coordination and composition facilities and their implementation. The prototype is 
based on the Erlang programming language and system developed at Ericsson [2]. 
Erlang is a programming language designed for building real-time and distributed 
applications. It was developed by the company Ericsson and has inherited many 
properties from functional programming languages. It uses dynamic typing and is 
garbage-collected. It features lists and tuples as built-in data types. A freeware version 
of the system is available on Internet (http : / /www. erlang . org/). Erlang offers 
a process concept and means of communication and synchronization. Scheduling is 
provided for free. Processes can be spawned at run-time. Processes communieate 
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using asynchronous message passing. There are no global data; only process 
identifiers (pids) are global. There is support for error handling and the creation of 
robust programs. Erlang supports programs distributed over several hosts (nodes). 
This feature is used in the implementation of ErlEda. 



Distribution 

The processes and shared variables of an ErlEda program are distributed over a set of 
nodes in a logically transparent way. Logical transparency means that the semantics 
of a program will not be affected by how its objects and shared variables are 
distributed. This property facilitates programming, though the distribution will affect 
performance. Our approach is to leave it to the user to specify, for each node, which 
EDA processes and shared variables should be allocated to it. With present 
technology it is not very important to consider network topology, thus we assume the 
target system to consist of a set of equal and fully connected nodes. The user specifies 
N LL expressions if the program is to be executed on N nodes. An option would be to 
use automatic allocation of the program, based on its total LL description possibly 
together with some information of frequency of shared variable accesses. 

Dynamic load balancing by means of re-allocation of processes and shared 
variables during execution is less useful since it would mean discarding our 
knowledge of the program's static structure provided by the LL description. Also, in 
most cases the communication overhead would be too high. 



Processes and Shared Variables 

Erlang processes represent EDA processes, henceforth called objects, and the 
behavior of the object is specified by an Erlang function. Some of the arguments of 
the function represent the acquaintances of the object, while others constitute its local, 
mutable state. An ErlEda program has an essentially fixed configuration of EDA 
objects interrelated by means of acquaintances to shared variables. Dynamic evolution 
of the configuration is possible under certain conditions, but this feature is not 
relevant for our discussion. 

The Erlang module called de (short for distributed eda) implements the access 
operations on shared data. Eor example, I-fetching the value of shared variable V to 
local variable X is expressed by X = de : if etch (V) . In the following we omit the 
“de : ” prefix for the shared data operations. 

In all there are six main access operations, if etch, istore, xf etch, xstore, 
sfetch, and sstore. As Erlang is a typeless language as far as data types are 
concerned, so is ErlEda. A shared variable may be bound to a number, an atom, a 
tuple or a list. It may also be empty (unbound). If V in the example above is empty 
then the executing object will be suspended until another object assigns a value to it. 

The EDA shared memory is managed by the set of EDA daemon processes, one 
per node. Remote I-data variables fetched by an object are cached locally for future 
references by objects in the same node. 
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The Linking Language 

A program distributed over N nodes is represented in the linking language by a list of 
N components, each of which specifies a module allocated to one node. An ErlEda 
module contains objects and shared variables. The module is represented by an Erlang 
tuple {ShVars, Objects, Exports, Imports}, where ShVars enumerates the shared 
variables, Objects the arguments and function names of the objects. Exports specifies 
which of the shared variables should be visible outside the module, and Imports lists 
the remote references needed hy this module. The loader program will check that the 
modules in the list are consistent and that all imports demanded by one node is 
satisfied by another node. If this is the case, the remote references will be resolved 
and the N node program will be constructed and started. Note that this internode 
linking amounts to a final combination operation with N operands. 

The linker and loader are themselves programmed in Erlang and constitute an 
Erlang module 11. Table 1 summarizes its main functions. The linking language LL 
in our system consists of these functions embedded within Erlang. 



Table 1. LL functions of the ErlEda system 



LL function 


Effect 


build(V, 0, E) 


Builds a module with shared variables V, objects 0, and 
exported sh. var:s E 


combine(Ml, M2) 


Builds the module Ml 1 M2 


comblist(L) 


Builds the combination of the modules of list L 


restrict(M, S) 


Builds the module M \ S 


rename(M, Old, New) 


Builds a version of M where name Old is replaced by New 


rename(M, OldNewList) 


As above but performs simultaneous replacements 
according to OldNewList, a list of pairs 


bind(M, Arg, Val), 
bind(M, ArgValList 


Builds a version of M where imported name(s) Arg is 
bound to value(s) Val 


cat(A, N) 


Forms a new name by concatenating A and N. N is 
typically a natural number. 



An Example: The Dirichlet Problem 

In this section we show how a typical computational problem can be described, 
partitioned and executed on a set of m nodes (workstations). 

Given is a 2D grid of numerical values, where the boundary points have constant 
values and the interior points at time (/+1) are determined as the average of the four 
neighboring points at time t. We assume interior points, and 4x (n+1) boundary 
points, making a total of (n+2)^ points. All interior points are zeros initially, and the 
constant boundary value is chosen as 100. 

The problem will here be partitioned column-wise, so that the program will be 
composed of a left boundary column, a right boundary column, and m partitions each 
containing w interior columns, where mxw equals n. Assume m>2 and n/m > 2. Each 
partition will iterate over its part of the grid. For each iteration the partition will have 
to exchange its boundary columns with those of its neighbors. For simplicity, the 
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number of iterations is fixed to 100. A partition will hold a local array where the 
leftmost and rightmost columns are local copies of neighboring columns in other 
partitions. In addition, a partition will contain and export shared variables Lout, 
Rout, and import shared columns Lin, Rin. (Hint: “L” and “R” indicate direction of 
information flow - leftward or rightward). 

The result of the computation is collected and printed by a collector object. 

We show the code for a partition in outline: 

part (II, W, N, Lin, Lout, Rin, Rout, Res) -> 

State = vec(W, addondOO, vec(N, 0), 100)), 

%Initial state 
% Iterate Niter times: 

Final = iterate (100, State, Lin, Lout, Rin, Rout), 

% Send final result to collector object: 

send(Final, II, Res), receive bye -> bye end. 
send([H|T], I, S) -> sstore({l, h},S), send(T, I+l, S) ; 
send ( [] , I, S) -> ok. 

iterate(0. State, Lin, Lout, Rin, Rout) -> State; 
iterate (Niter , State, Lin, Lout, Rin, Rout) -> 

% Exchange boundary colums with neighbours: 

xstore (hd (State) , Lout), xstore (last (State) , Rout), 
Statel = addon (xf etch (Rin) , State, xf etch (Lin) ) , 
iterate (NIter-1 , laplace (Statel) , Lin, Lout , Rin, Rout) . 

The function laplace computes a new state by averaging over neighboring points 
and is omitted here. Figure 1 outlines the partitioning of the program over the m 
nodes. Here the b boxes represent the left and right boundaries of the grid, the box 
P(/j the y’th partition, and the C box the collector. 



P(l) 





Fig. 1. Partitioning the Dirichlet program over m nodes. 



The module P(j) exports shared variables Lout, Rout and imports shared variables 
Lin, Rin. To connect these, global names are introduced. We use rO, rl, ..., rm for 
the right bound connections, and 10, 11, ...\m for the left bound ones. By renaming the 
Lin variable of P(/) and the Lout variable of P(/+l) to 1/, and the Rout variable of 
P(j) and the Rin variable of P(/+l) to rj, the desired connections are established. 

Table 2 summarizes the modules from which complete programs are built. 

To create and allocate the program for n=16, m=2 we create a list with two 
components. Assume that variables Lboundary, Rboundary, ... have been bound to the 
corresponding module descriptions (Table 2). 
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Table 2. Primitive modules for Dirichlet programs 



Name 


Sh. Vars 


Objects 




Imports 


lboundary 


Rout 


boundcol(lin,rout,n) 


rout 


lin, n 


rboundary 


Lout 


boundcol(rin,lout,n) 


lout 


rin, n 


partition 


lout,rout 


part(i 1 ,w,n, lin,lout,rin,rout,res) 


lout,rout 


il,w,n,lin,rin,res 


collector 


Res 


coll(res, n) 


res 


n 



Parti = bind (Partition, [{il,l}, {w,8}, {n,16}]), 

Part2 = bind (Partition, [{il,9}, {w,8}, {n,16}]), 
[comblist ( 

[rename (bind (Lboundary, n, 16) , [{lin,10}, {rout,rO}]), 

rename (Parti , [ { lin, 11 } , { lout ,10},{rin,r0}, 

{rout, rl} ] ) , 

bind (Collector,n, 16) ] ) , 
combine ( 

rename (Part 2 , [ { lin, 12 } , { lout ,ll},{rin,rl}, 

(rout, r2}] ) , 

rename (bind (Rboundary, n, 16) , [{rin,r2}, { lout , 12 } ] ) ) ] 

The following Erlang function "sys" composes such a system with arbitrary n and m 
(n =wm, m>2, w>2). Note the use of the LL function cat() to form new names and the 
use of the Erlang function constructor “fun(x) ->expression(x) end” to express the 
partitions P(i). 

sys (M, N) - > 

W = N div M, 

P = fun (I) -> 

P2 = rename (bind (Partition, [{w, w}, (n, n}]), 

[{lin, cat(l, I)}, {lout, cat(l, I-l)}, 

{rin, cat (r, I-l)}, {rout, cat(r, I)}]), 
bind(P2, il, W*(I-1)+1) 

end, 

Parray = listof (P, 2 , M-1) , %creates [P (2) , . . . , P (M-1) ] 
append ( [ 

[comblist ( [ 

rename (bind (Lboundary , n, N) , [ {lin, 10 } , { rout , rO } ] ) , 
P(l), bind (Collector, n,N) ])] , 

Parray, 

[combine (P (M) , 

rename (bind (Rboundary , n, N) , [ {rin, r2 } , { lout , 12 }]))]]) . 

Given the resulting list of modules as argument, the system loader will create and 
allocate each module on a separate node. Then all inter-module references will be 
resolved, and the execution of the modules will be started. 



Conclusions 



We have shown how an environment for parallel/distributed programming can be 
based on the coordination medium of the EDA shared memory together with the 
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linking language LL for composing programs. The resulting system is powerful and 
easy to use. The different access types offered by EDA makes the system more 
general than the message-passing based systems common today, yet avoiding the 
overhead of general cache coherence. The algebraic approach of LL supports 
hierarchic system design and reuse of software components and subsystems. 

Our prototype system is based on the Erlang language and system. Erlang offers a 
convenient platform for building distributed systems, but our principles can be 
accommodated in most environments. In [12] an implementation based on PVM [6] is 
described which however does not support the linking facilities. The performance of 
any system based on our principles will mainly be determined by the efficiency of the 
communication mechanisms offered by the hardware and operating system. 

The concept of a linking language can also support resource binding before 
execution. When the static structure of the program is known before run-time the 
system can allocate resources more efficiently. Eor instance, the optimal allocation of 
a given program on a given distributed system has been investigated in [9], based on a 
cost model accounting for computation, communication, and synchronization. 
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Abstract. The vision of Computational Grids promises an exciting fu- 
ture for the distributed simulation community. In this project we make 
a small but practical step toward the grand vision of distributed simula- 
tion by using certain prevailing Internet technologies to enable access of 
simulation services anytime and anywhere. Specihcally, this project fo- 
cuses on accessing distributed simulation of AGVs (Automated Guided 
Vehicle) in container port operations through the World Wide Web. The 
objectives are to explore and address relevant issues, evaluate various 
approaches, demonstrate a workable version. We initially construct the 
AGV simulation system in an indirect communication model and iden- 
tify its merits and demerits. Then, we explore the use of JINI technology 
for an efficient and robust direct communication architecture. 



1 Introduction 

The vision of computational grid has been well expanded in the book by Ian 
Foster and Carl Kesselman The Grid: Blueprint for a new Computing Infras- 
tructure\&\, which simply put, is an infrastructure that provides dependable, 
consistent, pervasive and inexpensive access to high-end computational capa- 
bilities. This will result in increased delivered computation by five orders of 
magnitude within a decade brought about by increased demand-driven access 
to computational power, increased utilization of idle capacity, greater sharing of 
computational results, and new problem solving techniques and tools. 

Meanwhile, in the past decade, the “Internet revolution” has been the most 
significant technological development. The technological development has crossed 
all the frontiers of time and space and truly reduced the world into a global vil- 
lage. Presently the development of the Internet appears to be driven by the mo- 
mentum created by the “network-centric model” [3], the ultimate goal of which is 
to turn the network into the computer and turn the client into a “thin” client, i.e. 
shifting computing burden from the client to server. It brings about the concept 
of balancing the computational burdens throughout the network amongst clients 
and servers with minimum resources expended on each specific client. Although 
arising from a different context, mainly driven by the rise of PC and Internet, 
this network-centric computing concept actually coincides with that of the Com- 
putational Grids which has been championed by people from super-computing 
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arena. In this project we make a small step toward the vision by using some of the 
prevailing Internet technologies. Specifically, the project focuses on Web-based 
distributed simulation of AGVs by using JINI and Java. The objectives are to 
explore and address relevant issues, evaluate various approaches, demonstrate a 
workable JINI-enabled version for a distributed AGV simulation system. 

The rest of paper is organized as follows: Section 2 briefly introduces a tra- 
ditional AGV simulation system and its Web-enabled counterpart; Section 3 
presents an ad-hoc 3-tier approach; Section 4 elaborates on an efficient and ro- 
bust JINI-enabled framework; Section 5 concludes the paper and discusses future 
work. 

2 Container Port Simulation 

To manage the complexities of the processes at the port, container operations 
include scheduling of the port operations, allocation of resources and various 
traffic control schemes. An AGV simulation system concentrates on all or part 
of the above aspects. The statistics collected in the simulation would give use- 
ful information to both the route layout designers and the routing algorithm 
designers for the AGVs. 

We have developed a prototype AGV simulation system[8]. It could run on 
SGI, SUN SPARG Workstation or other UNIX systems. 

The user could specify the number of AGVs involved and various other con- 
trol parameters at the beginning of a simulation run. Afterwards, the user can 
observe the visualization output of simulation execution. In the end, results are 
analyzed and reports are presented to the user. Usually, the user will invoke 
the simulation system with different parameters and repeat them many a time 
before coming to a conclusion. 

We now port the original system to the Web and Internet by utilizing latest 
Internet technologies such as JAVA and JINI in the following sections. 

We employ the “server fat approach” which means the critical computing is 
done on the server, usually a high performance machine, so as to tackle the com- 
plicated computing involved in a simulation in time. A Java “wrapper” program 
is developed for our legacy non-Java AGV simulation service. The GUI part in- 
clusive of both input and output handling is ported onto the Web-Browser by 
means of Java applets. In our framework, the client is the Web-Browser, or more 
specifically the applet embedded in an HTML page; the server is a Java-coded or 
a JINI-enabled simulation service. Figure 1 is a snapshot of visualization outputs 
of a running AGV simulation on a Web-Browser. 

3 An Ad-Hoc Approach 

To support Web-based applications, it has become a rather standard practice to 
adopt a 3-tier architecture(see Fig. 2) [5]. 

We developed an initial ad-hoc version based on a generic approach, where the 
Middle-tier servers mediate between sophisticated back-end services and the Web 
front-ends. This approach applies an indirect communication model. By name. 
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the clients will talk with the back-end services through a so-called ’’bridge”. 
This model has both merits and demerits. Clearly, the biggest problem is that 
the “bridge” may become a bottleneck in the system with the number of links 
increased. 

The user downloads Java applets from the Web server. The applets will then 
connect the client machine to a Lookup Service and send requests to the server. 
In response to a service request, the Lookup Service locates and forwards requests 
to the relevant AGV simulation service provider. 

On the high performance computer side, an Application Server Daemon is 
responsible for registering itself in a Lookup Service and waiting for incoming 
requests. Upon receiving a request, the Daemon will invoke the high performance 
computing application with the given parameters. Finally, the results will be 
transferred back to the the clients via the Lookup Service server. 

This version was developed by using JDK( Java Development Kit) with TCP / 
IP as its communication protocol. Besides the shortcomings tagged with the 
indirect communication model, there are other drawbacks. The system has to 
deal with joining and leaving process of any particular AGV service provider. 
The most difficult part is to handle all kinds of faults or exceptions, such as a 
sudden crash of a service provider, failures of a subnetwork. 

4 A Structured Approach: JINI-Enabled Framework 

JINI is a framework for building scalable, robust and truly distributed systems 
using Java [2,4]. Using JINI is a new approach to demonstrate the concept of 
Web-enabled AGV simulation system. The JINI approach provides a number of 
benefits including instant availability of services, impromptu community software 
and high flexibility and fault tolerance. 

By employing Java and JINI technology, we have designed, analyzed, and im- 
plemented a JINI-enabled framework for Web-based distributed AGV simulation 
system. 

JINI-enabled version adopts a direct communication model, where the clients 
will communicate with back-end services directly through a proxy provided by 
the service. A proxy is an arbitrary serializable Java object in the service item. 
It contains the information of how to interact with the service. 

The advantage of the direct scheme is obvious: firstly, there is no more 
“bridge” -like bottleneck; secondly, the communication delay will be reduced 
greatly [1]. A possible disadvantage of the direct communication model is re- 
lated to the security issue. Malicious attacks to the back-end server are possible 
because the server’s network address is published in its proxy. We here assume 
that the back-end machine itself has been well protected from intrusion. 

4.1 System Architecture: Services Federation 

Figure 3 shows a schematics of a so-called AGV simulation service federation 
which consists of one or more JINI Lookup Services. 

For AGV Service Providers (see Fig. 3): 
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— register: The AGV service providers will join the federation by registering 
themselves in one or more JINI Lookup Services in the federation. This step 
could occur at any time when a high performance machine starts up and is 
willing to provide a service. The service programs may be one that has been 
installed on the machine or one downloaded from an AGV services repository. 
If the services programs are downloaded from a services repository, firstly, 
they should be configured to be able run on the local machine. That may 
involve compiling, linking and other necessary processing. 

For Service Agents (see Fig. 3): 

1. Step (1): Firstly, the Service Agent should register itself in one or more 
Lookup Services so as to expose itself to clients and to be ready for serving. 
Meanwhile, it registered its interest in AGV Service in the federation. 

2. Step (2): Then, the agent will collect matched AGV Service Providers’ prox- 
ies in the federation. When a new Service Provider comes to join the feder- 
ation or a Service Provider leaves the federation, the JINI Lookup Services 
will notify the agent of the change. In this way, the agent could always keep 
a collection of all live Service Providers’ proxies without polling individuals 
in the federation. 

4.2 Distributed AGV Simulation Serving Scenario 

Figure 4 shows the scenario of how the Web-enabled distributed AGV simulation 
system serves a user. 

1. A user accesses the portal Web page of the AGV service, downloads Java 
applets and logs on the federation. 

2. The applet will then connect the user machine to a JINI Lookup Service and 
search the Service Agent. When a Service Agent is located, the applet will 
forward the user’s requirements to the Agent. 

3. After that, the Agent will utilize all available AGV simulation computing 
resources to execute AGV simulation tasks according to the user’s require- 
ments. Multiple independent modules or multiple batch jobs will be per- 
formed concurrently. 

4. The clients could monitor the execution of an individual task and observe 
its visualization output through the back-end server’s proxy. 

On the high performance computer side, upon receiving a request, a Service 
Provider Daemon will invoke the high performance computing application with 
the given parameters from the agents and finally the results will be transferred 
back to the requester. 

Obviously, the “batch work” feature of simulation executions makes dis- 
tributed multiprocessing possible. With the enforced system, we now are able 
to apply the agent approach to the following two categories of AGV simulation 
scenarios. 

— Iterative simulation and hatch jobs As we mentioned earlier, usually, a simu- 
lation run will be repeated for many a time with same parameters or different 
options before reaching a more objective result. In both cases, the agents sys- 
tem could dynamically distribute “batch tasks” among the service providers 
in the federation thus provide clients with a timely service. 
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Fig. 1. A Sample Session in Automated 
Guided Vehicle Simulation 



MODULE 1 




: Service Provider's proxy 
: Agent's proxy 



Fig. 3. An AGV Simulation Services 
Federation 




lation System 



Middle Tier 




Fig. 4. Service Access Protocol in the 
Agent-Based Framework 



— Time-parallel simulation This is an alternative way to Space-Parallel decom- 
position of simulation. The key idea is to decompose the simulation along 
the time dimension for multiprocessing [7], Usually, the computing involved 
in intervals is independent. From another point of view, we have a batch of 
jobs again. 



4.3 Scheduling and Self-Healing System 

Clearly, scheduling is an important issue in our system. First of all, the agent 
will estimate a weight value for each service provider based on an estimation 
function: 

'^provider = f {NuTuO f Proc, PcakPerf, Workload, Latency, Credit , . . .) 

NumofProc represents the number of the service provider’s physical proces- 
sors; PeakPerf denotes its peak performance; Workload indicates its current 
computing workload; and Latency means the communication delay. These pa- 
rameters are dynamic information. 

Credit is an additional information. Each Service Provider is tagged with a 
Credit value which is evaluated by the Agent depending on the number of failures 
of the Service Provider in earlier scenarios. Important and nontrivial tasks are 
always dispatched to a Service Provider with a high Credit value. 
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With JINI technology, a provider joining or leaving the federation could be 
observed by the agent. The agent will recalculate the weight values of service 
providers and refresh the ordered list regularly. If any task is currently assigned 
to a failed service provider, it will be reassigned to another provider. Meanwhile, 
any newly joined resource will be noticed and be harnessed immediately. In other 
words, it is an adaptive and self-healing system. 

5 Conclusions and Future Work 

This project arises from the need to bridge the gap between the supercomputers 
and the general users by utilizing some of the prevailing Internet technologies. 
Through a JINI-enabled distributed AGV simulation services community, the 
remote client could access the AGV simulation resources on a network even on 
the Internet with a common Web-Browser. The following are recaps of our main 
contributions: 

— We have evaluated two approaches in terms of their suitability to the Web- 
based AGV simulation application. 

— A Web-based and JINI-enabled distributed AGV simulation system has been 
implemented with Java and JINI technologies resulting in intelligence, self- 
heal, improved efficiency, high scalability, high reliability, fault tolerance, and 
with security measure. 
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Abstract. An investigation of parallel domain decomposition methods 
for interactive solution of 3D boundary value problem is presented. The 
various variants of algorithms are considered: the different values of sub- 
domain overlapping, the different numbers of subdomains and processors, 
accelerated and non-accelerated two-level iterations. The dependence of 
speedup on the computational parameters is discussed on the base of 
numerical experiment at the multiprocessor computer Fujitsu-Siemense 
RM600. 

Keywords: parallel implementation, multiprocessor, speedup, domain 
decomposition method, boundary value problem 



1 Introduction 

We consider numerical solution of grid three-dimensional boundary value prob- 
lems obtained by finite element, finite difference or finite volume approximations, 
see [1]. The conventional approach for solution of obtained linear algebraic sys- 
tems of equations with sparse matrices of very large order consists of using the 
incomplete factorization with conjugate gradient methods [2], which provide the 
number of iterations but do not garantee a scaling speedup in paral- 

lel direct implementation at the multiprocessor computer system. The cardinal 
improvement of situation is made by the application of domain decomposition 
[3] based on the simultaneous solution of subsystems into computational subdo- 
mains and organizing the external iteration with the sequential data transfers 
between neighbour processors. Here the topological equivalence of subdomains 
and computer net is supposed (subdomain <^=y processor) . 

The goal of presented paper is experimental investigation of efficiency of do- 
main decomposition methods for the solution of model three-dimensional bound- 
ary value problems at the multiprocessor computer Fujitsu-Siemense’ RM 600. 
The various variants of algorithms are considered: the different values of subdo- 
main over lapping, the different numbers of subdomains and processors, accel- 
erated and non-accelerated iterations. 

In section 2 we described the parallelized algorithms. The section 3 presents 
the discussion of the numerical results. 

* The work is supported by the RFBR grants N 99-01-00579, N 99-07-90422 
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2 The Description of the Algorithms 

We consider numerical solution of three-dimensional boundary value problem for 
the diffusion equation 



d du d du d du 



( 1 ) 



[x,y,z) £ £2; a,b,c>0, 

in parallelepipedoidal computational domain J? = (xo,Xi^i) x (j/o,J/m+i) X [zq x 
xjv+i) under Dirichlet condition at the boundary F 

y\r = g{x,y,z). ( 2 ) 

By means of simple finite difference or finite volume (see [1]) or finite element 
approximations at the regular grid 



Xi=Xi^iFhi, yj = yj^i+hy, Zk=Zk-i + hl, 

i= l,2,...,L+f, j = l,2,...,M+f, k = l,2,...,N+l 
we obtain the algebraic grid system of linear 7-point equations, see [1]: 



( 3 ) 



{Avkxk = (PoW - 






i=i 



( 4 ) 



k = 



where the local indeces 0, 1, 6 are corresponded to the central {i,j, fc)-th of grid 
stencil audits neighbours with multi-indeces [i—l,j,k), [i,j — l,k), [iFl,j,k), 
{i,j F 1, k), {i,j, k — 1), {i,ji ^ + 1) respectively. The right hand sides //P j, take 
into account boundary condition (2) and the matrix A is seven diagonal sym- 
metric positive definite matrix. The coefficients {pi)ij^k before the terms {vi)i j k, 
which correspond to boundary nodes, are supposed to be zero and corresponding 
boundary values are included in j , . 

The main part of computational complexity of solution of multidimensional 
boundary value problem consists of numerical solving the very large linear al- 
gebraic systems of equations with sparse structured matrices. One of the most 
efficient approach for the scaled parallel implementation is domain decomposi- 
tion method which in the simplest form can be discribed as follows. 

Let we have rectangular computer net with the p ■ q processors. The map- 
ping of algorithms at the computer architecture consists into definition of p ■ 
q corresponding subdomains. The grid computational domain is presented as 
J? = I = kn = where parallelepipedoidal subdomain 









[itFf] 



X b; 



m’ ^ m\ 



for all indeces k 



0,1,..., A 



1. The differences 



Ai = — k > —I and Aj = j'^_i — jk > 1 are integer measure of overlap- 

ping. 

For simplicity we consider the uniform domain decomposition, i.e. each (/, m)- 
th subdomain has the same number of nodes Kg™ = (*f ~ k)Um ~ + 2) 

and overlapping sizes Ai, Aj . We define the measure of overlapping 



S = lmKi^m/{FMN), 



( 5 ) 
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which characterizes the reduntant computations in DDM. 

The simplest version of iterative domain decomposition method, i.e. block 
Jacoby algorithm, can be written as 



1 T fl,m = j 









,n—l 



( 6 ) 



where n is the number of (external) iterations, Ai jn and ui^m are the square 
matrix and subvector which orders equal to the number Ki^m of nodes in sub- 
domain The rest four matrices are responsible for the connections with 

the neigbour subdomains. Let S'; is the total number of nonzero entries in 
the sum of the matrices A\^^ + ... + which characterizes the measure of 

communication time at each external iteration. 

The vector means the preliminary new iterative value, modified by the 
formula 

= (7) 

of Chebyshev acceleration with parameters r„ or by cojugate gradient approach, 
see [2]. The stopping criteratia for iterations (6), (7) is tolerance condition for 
the residual 

||r"||/||r°|| < £e, r^ = f’^-Av^. (8) 

Each system in (/,m)-th subdomain is solved by means of some internal 
iterative process 






n,t n,t—l\ 



9l,r 



-Ai, 



, 1 

'H.m ’ 



nfi ^ n-l 

^ 1,771 1,771 ’ 



(9) 



where is corresponding preconditioning matrix and some acceleration pro- 
cedure is applied to (9) similar to (7). 

The internal iterative process continue until to the given number of iterations 
Ui or it is interrupted by own tolerance criteria 



l,7n\\ 



r Lmll — 



L,™ = 9l,r 



-A 



,y-i 



l,'rr7l rn 



( 10 ) 



In general, under condition (10) the numbers of internal iterations tn can be 
different for different subdomains, but we mean its maximum value in this case. 

The total number of arithmetic operations in such two-level method is defined 
as 



Ql,rr,= {ni^rr,qi (11) 

k=l 

where n; „ is the number of external iterations, qi is the number of operations 
at each internal iteration and qi notes an additional volume of operations at one 
external iteration (in one subdomain), which are propertional to Ki „^. 
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3 The Analysis of Efficiency 

If Ta and Tc are average times of one arithmetic operation and tranfer of one 
value between processors then the total time of implementation of numerical 
solution of system (4) equals 



There is an evident consequence of this formula in the sence of influence 
of grid subdomain shape on the relative contribution of communication losses. 
Namely, because the numbers of arithmetic operations in (12) is proportional 
to , 7 > Ij and communication time is proportional to / + m, the optimal 

situation is I = m when all subdomains have the same square cross-sections. 

For the fixed total number LMN of meshpoints and given overlapping mea- 
sures Ai, Aj, increasing the number Im of subdomains and processors increases 
the number of external iterations and decreases the numbers of internal ones. 

For the fixed LMN and Im the increasing of overlapping values decreases the 
number of external iterations n; but increases the number of internal iterations 
tn, the communication time and reduntant coefficient 6 in (5). So, the problem 
of optimization of DD algorithm is complicated enough and is hardly investi- 
gated theoretically, because various tactics of choosing the tolerance criteria for 
internal iterations can be used, and estimation of condition numbers for variably 
preconditioned iterative processes is open question in the matrix theory. 

The analize of speedup factor and coefficient of efficiency 

Ri,m = Ri,m/{lm) can be different in terms of definition of denominator Ti. 
More direct definition ofTi is the execution time of the same DD method at one 
processor. But in fact DDM is not the best algorithm for sequential implementa- 
tion, and estimation of ifgm can be done more pessimistic, if we use, for example, 
algebraic multigrid method for solution of (4) which is badly parallelezable but 
has almost optimal order of computational efficiency. 

The another motivation of multiprocessor implementation of DDM consists of 
collecting together the computational resources of several processors for solution 
of the big problems which can not be solving efficiently at one processor (for 
example, it demands the huge flopping because of deficite of CPU). 

The dramatically high speedup can be obtained due to existance of the fast 
cash memory if its size is not sufficient for the runming of the problem at one 
processor but it can be used efficiently in multiprocessor regime, in a such case 
the super-linear speedup, i.e. ifgm > Im and Ei^m > Ij can be achieved typically. 

We consider the results of numerical experiments for the model Dirichlet 
boundary value problems (1), (2) with constant coefficients a = b = c = 1. 

The solutions were sought at the square grids for different L, M, N and 
“line” processor nets with m = I and I = 1,2,4, 8. The simplest Jacoby iterative 
method was used for the external iterations with sy = 0.5 and for internal 
iterations with the fixed values = 5, 10,20. In the Tables 1-4 the numbers of 
external iterations rie and CPU times Ti are presented. The computations were 
made at the Fujitsu-Siemense RM 600 E30 SMP-system containing 4.2GB3de 
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Table 1. Numerical results for DDM without overlapping, 
grids L • 64 • 64, rn = 5, 10, 20 



p 


L 


ni 


Tp{sec) 


1 


64 


223, 


111, 


55 


124 


, 120, 


123 


2 


128 


288, 


149, 


81 


161 


, 163, 


179 


4 


256 


289, 


150, 


81 


165 


, 173, 


183 



Table 2. The results for DDM without overlapping, 
grids L • 150 • 150, Wj = 5, 10, 20 



p 


L 


UL 


Tp{sec) 


1 


50 


272, 


136, 


68 


637 


, 635, 


635 


2 


100 


858, 


440, 


232 


2085, 


2104, 


2274 


4 


200 


1153, 


592, 


312 


3084, 


3053, 


3266 



Table 3. Numerical results for DDM with overlapping Ai = 10, 
grids L • 64 • 64, ni = 5, 10, 20 



p 


L 


riL 


Tp{sec) 


1 


64 


167, 


00* 

00 


41 


94 


, 91, 


93 


2 


128 


216, 


111, 


60 


122, 


, 123, 


135 


4 


256 


216, 


112, 


60 


125, 


, 131, 


138 



Table 4. Numerical results for DDM with overlapping Ai = 10, 
grids L • 150 • 150, m = 5, 10, 20 



p 


L 


UL 


Tp{sec) 


1 


50 


204, 


102, 


51 


481 


, 479, 


480 


2 


100 


643, 


330, 


174 


1574, 


1589, 


1717 


4 


200 


864, 


444, 


234 


2328, 


2305, 


2466 



of shared memory and 8 processors RIOOOO running at frequency of 250 mhz. 
System software includes Reliant UNIX operating system, C/C++ and Fortran 
compilers and MPICH 1.2.0 software. 

The presented results provide the following conclusions. 

a. The simultaneous increasing the grid numbers L by factor p and the num- 
bers of processors increase the number of external iterations and CPU time 
proportionally to p“, a « 1/2. It provides a good enough speedup and efficiency 
of parallelized domain decomposition method. 

b. The improvement of general convergence rate of iteration process demands 
using the acceleration of external iterations, an implementation of more fast in- 
ternal solver instead the simplest Jacoby algorithm and dynamical control of 
two level iterations. It can be done on the base of advanced incomplete factor- 
ization methods and application of preconditioned conjugate gradient methods 
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[2], The problem of optimization of parallel DDM needs on additional theoretical 
as experimental investigations. 

c. The further increasing of speedup can be achieved by means more careful 
mapping of algorithm structure into the computer architecture on the base of 
simubtaneous implementation of data multicommunications and arithmetic op- 
erations in fast cash distributed or shared memory. The considered approaches 
can be used efficiently for more general boundary value problems, including non- 
linear or nonstationary differential equations instead of (1). It should increase 
the total computational complexity of algorithm but not decrease the speedup 
of the parallel implementation. 
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Abstract. Percolation is the process that causes a a solvent (e.g. wa- 
ter) to pass through a permeable substance and to extract a soluble 
constituent. Cellular Automata provide a very powerful tool for the sim- 
ulation and the analysis of percolation processes. In some cases, however, 
the most challenging problem is perhaps to reproduce correctly within 
the automaton the features of the percolation bed, that is, the porous 
medium the solvent flows through. In this paper we present a compu- 
tational model for the controlled generation of two-dimensional perco- 
lation beds based on stochastic Cellular Automata, and we show how 
it has been applied to the generation of percolation beds suitable for 
the simulation of pesticide percolation in the soil. In particular, the ap- 
proach we present permits to keep under control the shape and the size 
of the single components of the bed (e.g. grains), and their position. In 
order to reproduce percolation beds of feasible size, and to manage large 
automata, the model has been implemented on a cluster of workstations. 



1 Introduction 

Percolation is the process that causes a a solvent (for example, water) to pass 
through a permeable substance, and to extract a soluble constituent. Percolation 
processes have been thoroughly studied both from theoretical [1] and applica- 
tive [2] viewpoints. Percolation theory defines and offers formal frameworks for 
the creation of abstract, analMical, and computational models dealing with a 
wide range of phenomena. 

The computer simulation of percolation phenomena occurring in porous me- 
dia is a very challenging problem, for which several computational models and 
techniques have been introduced, from finite elements algorithms to Gellular Au- 
tomata (GA) [3]. The latter provide a very powerful tool for the simulation of 
percolation processes, for instance, in case of coffee [4], pesticides [5] and carbon 
black rubber compounds [6]. In these models, some cells of the automaton repro- 
duce the percolation bed, that is, the porous medium the solvent flows through, 
while others are empty or contain the solvent. Reproducing in the automaton 
the features of real percolation beds, however, can be a very challenging task. 
When available, microscope images of real case studies can be used directly or as 
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Fig. 1. The von Neumann neighborhood of cell C{i,j). 



a model to be reproduced, both for the shape and the size of the single compo- 
nents (for example, grains of coffee), and their position within the bed. In some 
cases, however, experimental data are hard or expensive to obtain. Therefore, a 
computer-based method for the generation of percolation beds reproducing as 
much as possible the features of real cases could be a very useful tool. In partic- 
ular, in this paper we focus on the generation of percolation beds reproducing 
soil, that have been used for the simulation of pesticide percolation. 

2 The Model 

Informally, CA can be viewed as parallel computing machines made of a large 
number of processors, called cells, that perform simple operations, opposed to 
traditional sequential computers based on a single processing unit that executes 
complex operations. Cells are usually arranged on a regular grid. CA evolve 
through a sequence of discrete time steps. At a given time, every cell is char- 
acterized by a state, belonging to a finite set. The state of the cells is updated 
simultaneously at each step according to a given update rule. The rule deter- 
mines the new state of each cell according to the current state of the cell itself 
and the state of the neighboring cells, located on adjacent nodes of the grid. In 
our model cells are arranged on a two-dimensional N x M square grid, and we 
adopted the von Neumann neighborhood, where every cell has four neighbors. 
More in detail, the neighbors of the cell located in position {i,j) in the grid are 
the cells in positions (i — l,j), {i + 1, j), {i,j — 1), and [i,j + 1) (see Fig. 1). 

In this paper, we present a computational model based on Stochastic Cellu- 
lar Automata (that is, CA whose update rule is probabilistic) for the controlled 
generation of two-dimensional patterns, that permits to keep under control the 
morphological properties (shape and size) of the patterns produced. The cells of 
the automaton can assume two states: empty or occupied. Patterns are formed 
by adjacent occupied cells [components), and are dynamically generated start- 
ing from a single occupied cell that assumes the role of an initial seed, while the 
remaining cells of the automaton are empty. When the automaton is started, 
the seed starts growing, with a process similar to cell mitosis, and generates new 
components into adjacent empty cells. The reproductive potential of the seed is 
expressed with an integer number of reproductive abilities, that determines the 
final size of the pattern. In fact, each reproduction uses one reproductive ability; 
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therefore, the maximum number of components that can be generated equals 
the number of reproductive abilities assigned to the initial seed. A newly gen- 
erated component takes some reproductive abilities from its parent component, 
becoming in this way able to produce offspring components by itself. Once a cell 
has become occupied by a component, it cannot revert its state back to empty. 

The choice of the neighboring cells that will contain a new component, that 
is, the direction of growth of the pattern, is controlled by a probabilistic mech- 
anism that can be fine tuned in order to obtain patterns with different shape. 
That is, a probability value is associated with each of the four possible growth 
directions. Thus, each evolution of the CA starting with the same probability 
distribution produces (with high probability) a different pattern. Anyway, all 
patterns generated with the same probability distribution have the same macro- 
scopic properties, that is, look similar to the human eye. 

The automaton can be initialized with more than one seed. In this case, each 
pattern is characterized by a different id number. A cell can belong to at most one 
pattern, and all the cells belonging to the same pattern have the same id number. 
A different number of reproductive abilities, and different growth probabilities 
can be assigned to each seed, in order to obtain patterns of different size and 
shape within the same automaton. Whenever an empty cell has neighbors with 
different id numbers trying to generate a new component in its position, its 
occupation is determined by competition among candidate parents. 

In the following section we present the formal description of the model for the 
generation of a single pattern. Then, we show how the model can be extended to 
more than one pattern. Finally, we show how the model has been applied to the 
generation of artificial percolation beds for the simulation of pesticide leaching 
in the soil. 

3 Generating a Single Pattern 

We now give the formal definition of the automaton. The CA is defined by a 
5-tuple CA = (i?, H, Q, /, /), where: 

1. i? = |0<i<A— f;0<j<M— f;A, Me IN}, is a two-dimensional 

N X M lattice; 

2. H is the von Neumann neighborhood; 

3. Q = {W, CAP, P) is the finite set of the values of variables of state, where: 

(a) ID is the cell identifier, 0 in case of an empty cell, 1 otherwise; 

(b) CAP = (A, S', IF, E) are integers representing the number of reproduc- 
tive abilities respectively towards north, south, west, and east; 

(c) P = {E^ , Psx , ^e) £^re the probabilities of directing the reproduc- 

tive abilities towards north, south, west and east, that is, the probabil- 
ities associated with the four possible directions of growth, such that 
'^ie{N,S,W,E} = 1 - 

4. / : Q X t Q is the state transition function (update rule); 

5. i : if — ^ Q is the initialization function. 
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Fig. 2. The structure of a cell of the automaton. 



The structure of a cell of the automaton is shown in Fig. 2. From now on, 
we will refer with C(i,j) to the cell located at coordinates {i,j), with 
to its identifier, with and to the number of 

reproductive abilities contained in each of its four portions, with P(i,j) = 
(Fjv(b -^£;(b j)) to the probability values associated with 

the four directions. 

The initialization function / picks a cell [k, 1), usually located near the center 
of the automaton, sets its ID to one (it becomes occupied), and sets the other 
parameters corresponding to the reproductive abilities and the probabilities de- 
fined by the user. Every other cell {i,j) of the automaton has ID set to zero, no 
reproductive abilities, and probability values P(i,j) = P(A;,/). 

The transition function / is the composition of two functions. Therefore, 
f = ho g, where g : Q x t Q is defined as follows: 

g{C{i,j),S{i - - 1)) = 

with C'{i,j) = {W'{i,j),CAP'{i,j),P{i,j)), where: 

ID'{i,j) = ID{k,l) 

If ID{i,j) = 0, then (k,l) are the coordinates of the cell corresponding to 
the maximum of {N [i — 1)} (the highest 
bidder among the neighbors). Otherwise, (k,l) = {i,j)- That is, if an empty 
cell has at least one neighbor with ID set to one, and able to reproduce in its 
direction, it becomes occupied. Otherwise, the ID does not change, and remains 
one or zero. An occupied cell (newly or not) collects reproductive abilities from 
occupied neighbors. That is: 

r S{i - l,j) if ID'{i,j) = ID{i - l,j) = 1 

= 

otherwise 

If the ID of the cell at the previous step was zero and the cell becomes occupied 
for effect of function g, then the number of the reproductive abilities inherited 
from the parent neighbor (at least one) is decreased by one, used by cell {i,j) 
itself to become occupied. Also, P(i,j) = P{i,j), that is, probabilities are left 
unchanged by g. The function h : Q — t Q is defined as follows: 

h{{W{i,j),CAP{i,j),P{i,j))) = {W'{i,j),CAP'{i,j),P{i,j)) 
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Fig. 3. Growth of a seed with 5000 reproductive abilities, P = 1/4 in each direction, 
at step 5, 10, 50, 100, 200, and 500. 



where = W[i,j) (the identifier does not change). The actual change 

takes place in the distribution in the four portions of the reproductive abilities 
collected at the previous step (if any). That is, given i/'jv = N{i,j) + S[i,j) + 
lT(i,/) + we have: 

Tjv 
k = l 

where are Tjv random variables such that Vr^, Pr[r^ = 1] = Pa?)*,/). Then, 
set Ts = i/jv “ we have: 

k=l 

where Vr^, Pr[r^ = 1] = Ps{hj)- In the same way, given Tw = Tn ~ 



Tw 

W\i,j) = Y,rT 



k=l 



with Vr/^, Pr[r/^ = 1] = Pw{i,j)- Finally, given Te = Tw ~ 

P'(b/) = Te 

Also, h might change the four probabilities P(i,/). One possible way could be 
to rotate the probabilities, for example, by setting pv(i,/) = PE{i,j),Pw{i,j) = 
PN{i,j)i and so on. The final shape of the patterns generated by the automaton is 
determined by the initial probabilities assigned to the starting seed and how they 
are changed during the evolution. For example, if the probabilities are initially 
set to 1/4 in each direction and left unchanged by function h, the result is the 
uniform growth of the pattern in the four directions, with a final shape that 
looks circular, as shown in Fig. 3. Different simulations would lead (with high 
probability) to a slightly different result, that anyway would still look circular 
to the human eye. 
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4 The Multi Pattern Model 

In order to let more than a single pattern grow within the same automaton, we 
have to make some changes in the model introduced in the previous section. The 
user may now define different classes of patterns (each one possibly corresponding 
to a different growth probability distribution) he wants to generate, the size of the 
patterns belonging to each class (usually picked at random between minimum 
and maximum values), and how many patterns he wants for each class (the 
exact number, or assigning a probability value to each class). The initialization 
function has now to plant more than one seed; this can be simply implemented 
by choosing a cell at random, checking whether it is occupied or not, and, in 
the latter case, defining the initial parameters of the seed according to the user’s 
input. 

An occupied cell belongs to only one pattern. In order to be able to recognize 
which pattern a cell belongs to, the ID is now defined as an integer (greater than 
zero, not greater than the number of the patterns). Each pattern is therefore 
formed by cells with the same W . The ID is assigned to the patterns by the 
initialization step, that simply associates a different ID number with each seed. 

Moreover, an empty cell might now have neighbors belonging to different 
patters trying simultaneously to expand toward its direction, generating a new 
component in its position. The transition function has therefore to be modified 
in order to solve this possible conflict. The idea is the following: the cell takes 
the ID of the highest bidder, that is, the neighbor which has more reproductive 
abilities in the portion adjacent to the empty cell. Ties are broken arbitrarily. 
Also, occupied cells can trade abilities with neighbors with the same ID. The 
transition sub-function g therefore becomes: 

g{C{i,j),S{i - - 1)) = 

with C'{i,j) = {ID'{i,j),CAP'{i,j),P{i,j)), where: 

ID'{i,j) = ID{k,l) 

P{i,j) = P{k,l) 

If cell {i,j) is empty, and at least one neighbor has reproductive abilities in the 
portion adjacent to {i,j), then [k, 1) are the coordinates of the cell corresponding 
to the maximum of {N{i + 1, j), S[i — 1, j), W[i,j + 1), E[i,j — 1)} (the highest 
bidding neighbor). Otherwise, (k,[) = {i,j). Also, 

(S{i-lJ) if ID'{iJ) = ID{i-lJ) 

N\i,j) = 

K.N{i,j) otherwise 

The terms and are defined in a similar way. As in the 

single-pattern case, if W{i,j) = 0 and W'{i,j) > 0 the abilities inherited from 
the parent cell are decreased by one. The transition sub-function h remains 
unchanged. 
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Table 1. Different types of soil texture and corresponding grain size. 



Soil 


Grain Diameter 


Clays 


< .002 mm 


Silts 


.002 — .02 mm 


Sands 


.02 — 2 mm 


Coarse Fragments 


> 2 mm 



5 Generation of Soil Percolation Beds 

Pesticides have become essential elements for modern agriculture, in order to 
obtain production yields sufficient to satisfy the growing needs of the increas- 
ing world population. More than two million tons of pesticide products derived 
from 900 active ingredients are used each year worldwide. The extensive use of 
pesticides can entail risks for the environment and non-target organisms, in- 
cluding humans. When applied to crops, pesticides are absorbed by soil. Then, 
when water flows through the soil because of rain or floods, pesticides can be 
released into it. Water containing pesticide may reach the groundwater layer 
because of gravity. Since groundwater is usually the source of common tap wa- 
ter, it is straightforward to understand the polluting danger deriving from the 
excessive use of pesticides. For percolation beds used in simulations of this case 
study, experimental data concern shape and size of the grains composing the 
bed. Soil separates (individual grains of soil mineral materials) can be divided 
into three main particle size classes, shown in Table 1. Size fractions are generally 
determined either by sieving or by Stokes settling rates. Soil is usually classi- 
fied according to the percentage of clay, silt, and sand grains it contains. Thus, 
percolation beds can also be composed of different classes of grains. It has been 
experimentally observed (by microscope images) that the larger are the grains, 
the more regular is their shape. Clay grains are very irregular, while sand grains 
(the largest) are more or less spherical. The percentage of the percolation bed 
occupied by grains usually ranges from 40% to 60%. The position of the grains 
in the bed is based on the following rule-of-thumb (based on geologists’ advice): 
voids between grains cannot be larger than the maximum size of the grains in 
the bed. 

The size of grains is reproduced in our model as follows. We let the size 
of the smallest grain composing the bed correspond to the size of a cell of the 
automaton. Therefore, the number of abilities assigned to a seed ranges from 0 
(the smallest grain possible) to R = M /m, where M and m are the maximum and 
minimum grain size allowed, respectively. For example, in case of silt percolation 
beds, the largest grain can be 100 times larger than the smallest one. Therefore, 
the number of abilities assigned to each seed is selected at random between 0 
and 100 according to the probability distribution given by the user. For example, 
we can have uniform probability over the size range. In case of percolation beds 
composed by different grain types, we first select the seed type at random (with 
probabilities proportional to the percentage of grains of each type we want in 
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Fig. 4. Example of a silt percolation bed, generated by the multi-pattern model. 



the bed), then assign to it the reproductive abilities. The shape of the grains can 
be reproduced in our model by setting the growth probabilities to 1/4 in each 
direction, that, as we have shown, tends to generate circular patterns. Also, the 
smallest is the pattern generated, the more irregular is its shape, as in the real 
case. 

The final position of the grains in the percolation bed is influenced by the 
choice of the cells containing the initial seeds. Seeds cannot be too close, since 
in this case they would not have enough room to grow. The effect would be a 
percolation bed with some parts packed with grains and some under-populated 
regions, violating the rule determining the maximum size of empty spaces. We 
solved this problem with a simple trick: whenever we plant a seed in the au- 
tomaton and we mark its cell as occupied, we also mark the surrounding cells 
as “unavailable” for seeds, in order to prevent another seed to be too close the 
current one. The size of the forbidden area roughly corresponds to the average 
size of a grain. Even if this rule does not guarantee that too many grains are 
packed together, the experimental results have been satisfactory. An example is 
shown in Fig. 4. 

The initial particle size of the mineral fraction influences many processes of 
soil development and the properties of the resulting soil. Coarser (larger) ma- 
terials generally have high hydraulic conductivities, while finer materials have 
low hydraulic conductivities, that is, let less water (and thus less pollutants) 
reach the groundwater layer. Thus, the sandier is the soil (that is, the larger 
are the grains composing it), the higher is the risk for water containing pes- 
ticides to reach the groundwater layer. These results have been reproduced in 
our model [7], leading us to conjecture that the morphological properties of the 
percolation beds have been captured successfully. 

6 The Parallel Implementation 

In order to generate percolation beds of feasible size and to model real conditions, 
a very large automaton has often to be employed, and the update rule has to be 
applied a large number of times. For example, in percolation bed of silt, each cell 
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represents a square portion of soil with a side of 2 /xm. This makes the algorithm 
time and memory consuming even for the most powerful sequential machines. 
Therefore, we implemented our model on a cluster of workstations, using the 
MPI (Message Passing Interface) library. 

The two-dimensional grid forming the automaton is divided vertically into n 
layers, where n is the number of processors available. Basically, each processor 
updates a slice of the automaton. That is, processor p will take care of cells 
belonging to rows from —{p— 1) to —p — 1, where N is the overall number of 
rows. Thus, at each update step the processors update simultaneously the cells 
belonging to their part, divided in rows numbered from 1 to N/n. 

Rows 1 and N/n of each layer have to be updated according (also) to the 
state of cells belonging to other processors. For this reason, we added rows 0 and 
[N/n) + 1 to each part. These two rows are composed of so-called ghost cells. 
Before each update step, processor number p communicates the state of its cells 
of row 1 to processor number p — I, and the state of the cells of row N/n to 
processor number p + 1 . It also receives the state of the cells of row N/n from 
processor p — 1, that form its own row 0, and the state of the cells of row 1 from 
processor p + 1, that form its own row [N/n) + 1. In this way, each processor can 
update its border rows copying in its ghost cells the state of the neighboring cells 
belonging to different processors. In order to parallelize also the communication 
among processors, we make processor p communicate first with p + 1, then with 
p — 1 if p is even; vice versa if p is odd. 

The parallel implementation of the update routine of the automaton can be 
summed up as follows, where p is the processor number: 

1. if (p is odd): 

(a) send row 1 to processor p — 1; 

(b) receive row 0 from processor p — 1; 

(c) send row N/n to processor p + 1; 

(d) receive row [N/n) + 1 from processor p+ 1; 

2. else (p is even) 

(a) receive row [N/n) + 1 from processor p+ 1; 

(b) send row N/n to processor p + 1; 

(c) receive row 0 from processor p — 1; 

(d) send row 1 to processor p — 1; 

3. Update cells; 

7 Conclusions 

Among the problems arising in the design and development of computer envi- 
ronments for the simulation of percolation in porous media, the correct repre- 
sentation of percolation beds is a crucial issue. The model we presented, that 
controls both the properties of single components and their overall position, can 
provide an efficient tool for this task, as shown in this work for the case of soil 
percolation beds. 




400 



S. Bandini, G. Mauri, and G. Pavesi 



References 

1. D. Stauffer, A. Aharony. Introduction to Percolation Theory. Taylor & Francis, 
London, 1992. 

2. M. Sahimi. Applications of Percolation Theory. Taylor & Francis, London, 1994. 

3. M. Sahimi (ed.). Flow Phenomena in Rocks: from Gontinuum Models to Frac- 
tals, Percolation, Gellular Automata and Simulated Annealing. Rev. of Modem 
Physics, 65(4), 1993. 

4. G. Borsani, G. Gattaneo, V. de Mattel, U. Jocher, B. Zampini. 2D and 3D Lattice 
Gas Techniques of Fluid-Dynamic Simulations. In S. Bandini, R. Serra, F. Suggi 
Liverani (eds.). Cellular Automata: Research Towards Industry, Springer Verlag, 
Berlin, 1998. 

5. S. Bandini, G. Mauri, G. Pavesi, G. Simone. A Parallel Model Based on Gellular 
Automata for the Simulation of Pesticide Percolation in the Soil. In V.Malyshkin 
(ed.). Parallel Computing and Technologies, Lecture Notes in Gomputer Science 
1662, Springer Verlag, Berlin, 1999. 

6. S. Bandini, M. Magagnini. Parallel Simulation of Dynamic Properties of Filled 
Rubber Gompounds Based on Gellular Automata. Parallel Computing, 27(5), 
643-661, 2001. 

7. S. Bandini, G. Mauri, G. Pavesi, G. Simone. Parallel Simulation of Reaction Diffu- 
sion Phenomena in Percolation Processes: a Model Based on Gellular Automata. 
Future Ceneration Computer Systems, 17(6), 679-688, 2001. 




Parallel Simulation of 3D Incompressible Flows 
and Performance Comparison 
for Several MPP and Cluster Platforms 



Oleg Bessonov^, Dominique Fougere^, and Bernard Roux^ 



^ Institute for Problems in Mechanics of Russian Academy of Sciences, 

101, Vernadsky ave., 117526 Moscow, Russia 
^ Laboratoire de Modelisation en Mecanique a Marseille, L3M-IMT, La Jetee, 
Technopole de Chateau-Gombert, 13451 Marseille Cedex 20, France 
bessOipmnet . ru, {f ougere,broux@13m.univ-mrs . fr} 



Abstract. This paper describes a parallelization method for the numer- 
ical simulation of 3D incompressible viscous flow in a cylindrical domain. 
Implementation details are discussed for efficient parallelization on dis- 
tributed memory computers with relatively slow communication links. 
The developed parallel code is used for the performance evaluation of sev- 
eral computers of different architectures, with the number of processors 
used from 1 to 16. The obtained results are compared to the measured 
computational and communication characteristics of these computers. 



1 Introduction 

Modern distributed memory parallel computers are characterized by very high 
computational potential. Therefore they are very attractive for the solution of 
time-consuming non-steady 3D CFD problems. In contrast, the communication 
speed of interconnection networks is usually much lower than necessary to ex- 
ploit fully the intrinsic parallelism of numerical algorithms. With the rapid de- 
velopment of superscalar RISC microprocessors, the gap between computational 
speed and interconnection capacity becomes even wider. Therefore, much atten- 
tion should be paid on the development of numerical methods and parallelization 
algorithms that are economical from the point of view of data exchanges. 

For simulations of flows in 3D regular domains (rectangular or cylindrical), 
the Finite Difference (FDM) and Finite Volume (FVM) methods have proved to 
be very efficient [1]. Straightforward implementations of these methods normally 
use a substantial fraction of ’’explicit” time integration codes, that don’t need 
data exchanges during the computational steps. Only a small part of data, the 
separator (boundary) planes between subdomains, belonging to different com- 
putational nodes, need to be transferred after completion of every timestep. 

Unfortunately, a realistic simulation of incompressible viscous flows can’t be 
performed by pure explicit code due to timestep constraints, especially for flows 
with highly diffusive processes (e.g. low Prandtl melt flow in crystal growth ap- 
plications). The implicit methods should be incorporated, that involve solving 
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3-diagonal linear systems in every spatial direction (for the economical ADI ap- 
proach). Another numerical difficulty of incompressible flow simulation arises 
from the physical nature of pressure. The pressure Poisson equation must be 
solved globally in the entire domain on every timestep. In order to avoid ex- 
pensive iterative methods, the direct Fourier method is often used that involves 
Fast Fourier transfer (FFT) steps and 3-diagonal sweeps. Parallelization of FFT 
requires full data exchange between nodes and is therefore very uneconomical. 
In order to reduce amount of data exchanges, several approaches have been sug- 
gested by different authors [2,3]. However, these algorithms are either much less 
accurate than necessary, or less economical than the Fourier method (they would 
need, for example, O(A^) operations vs. 0(A log(A)) for ID transform). 

The present work is based on the previous effort on parallelization of 3D 
CFD problem [4] where one-dimensional decomposition of a computational do- 
main was considered. Now, the analysis has been extended to multidimensional 
decomposition, with the consideration of all arising questions. To avoid exces- 
sive data exchanges, a new method for solving Poisson equation has been de- 
veloped, based on a cyclic reduction of arising linear systems in frame of the 
FACR approach [5]. As a result, the algorithmically and numerically economical 
implementation has been obtained for the number of processors up to 16. 

Another part of this paper is devoted to the comparative analysis of perfor- 
mance and parallelization efficiency for different distributed memory machines - 
massive parallel computers (MPP) and SMP clusters, using this new code as a 
benchmark. Some previous work has been performed in this area [6]. The current 
analysis is based on the evaluation of parallelization efficiency of the presented 
code for different number of processors (2, 4, 8, 16) and problem sizes in com- 
parison with the measured computational and communication characteristics. 

2 Description of the Numerical Method 

The numerical problem considered here is the solution of 3D non-stationary 
Navier-Stokes equations in Boussinesq approximation for incompressible viscous 
flow in a cylindrical domain. This sort of simulation is used in crystal growth 
applications, like semiconductor melt flows in Czochralski apparatus [7]. 

The velocity-pressure formulation and FVM discretization are employed, 
with the decoupled solution of momentum, pressure and temperature equations 
using the Fractional step (pressure correction) method. The time integration 
scheme is partially implicit, with the implicit treatment of the most critical 
terms using ADI (Alternating directions implicit) approach. The pressure Pois- 
son equation is normally solved by efficient Fourier method, that involves FFTs 
in two spatial directions and 3-diagonal systems solutions in the last direction. 
This numerical method is fully direct and doesn’t involve costly iterative steps. 

From the point of view of data processing, the computations are organized 
by the following way: 

— The cylindrical computational domain is considered as a 3-dimensional array 

((p, z, r). All computations are performed in the most efficient manner, using 
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Fig. 1. ID and 2D decompositions of a computational domain 



the 1-st index of array as the innermost one in Fortran loops. An iteration 
of the outer loop can be considered as computations in a plane that 

is moving in the direction r as a ’’frontal plane of computations” [4], 

— All explicit parts of the algorithm are trivial and simply form 2D loops within 
this plane of computations. 

— The implicit part is split into solving 3-diagonal linear systems in all 3 direc- 
tions each consisting of 2 sweeps (forward and backward) in corre- 

sponding direction. All sweeps in the directions cp and z involve processing 
of data located within 2D plane of computations. Sweeps in the direction r 
look like a slow motion of this plane in forward or backward direction. 

— The Fourier method comprises FFTs in the directions cp and z, that again 
involve processing within a plane of computations, and solving 3-diagonal 
systems in the direction r, implemented as for the implicit step. 

3 Parallelization of the Algorithm 

The parallelization method is based on the splitting a computational domain in 
the last 2 directions, r and z. The current implementation includes the following 
variants: lx 1,2x1, 4x1, 4x2 and 4x4 (Fig. 1), from 1 to 16 CPUs (with a 
possible extension to 8 x 4 for 32 CPUs). 

Consider first the parallelization method for 1-dimensional splitting. 

— Computational domains are overlapped, with one neighbour’s plane (2D ar- 
ray of data) stored in a node for each boundary. This is necessary for calcu- 
lation of some terms in discretized equations. 

— All parts of the numerical algorithm involving calculations only within a 
plane of computations [p, z) are processed independently in each node and 
don’t need data exchanges. These parts include all explicit steps, implicit 
sweeps in the directions p and z, and FFTs in these directions. Data ex- 
changes between adjacent processor nodes are performed only between these 
steps (when necessary), transmitting full 2D arrays of data. 

— Sweeps in the direction r can’t be parallelized in frame of the conventional 
3-diagonal solver. Instead, the twisted factorization is used for 2 processors, 
or two-way parallel partition method [8,4] for 4 or more processors. These 
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Fig. 2. Blocked transposition for parallelization of FFT in the direction ^ 



methods employ more complicated way of Gauss elimination procedure, that 
can be done simultaneously in all subdomains. These modified sweeps are 
performed as frontal planes of computations, with exchange of full 2D data 
arrays between adjacent nodes when necessary. Parallel solution of 3-diagonal 
system on 4 processors requires 3 to 6 such exchanges (depending on the sort 
of 3-diagonal matrix). 

As a result, the parallelized numerical method is algebraically identical to the 
sequential one. This is different from the iterative domain decomposition ap- 
proach, when the efficiency depends on convergence properties of the algorithm 
and can be violated by the splitting. 

The above method has demonstrated the good parallelization efficiency [4,6]. 
However, the increased complexity of solving 3-diagonal systems limits the num- 
ber of processors by 4, at most 8. The natural way to overcome this limitation is 
to extend decomposition into the 2-nd spatial direction [z). This would increase 
the limitation to 4 x 4 = 16 processors. 

The parallelization procedure and data distribution for the direction x are 
similar to those of the direction r for almost all steps of the algorithm. However, 
FFTs in the direction x can’t be efficiently parallelized because multiple trans- 
missions of all processed data are required. The way to reduce the number of 
data transfer is to split a computational domain in the last spatial direction ((p) 
and rearrange data for this operation. Figure 2 illustrates this rearrangement 
(blocked transposition) for 4 processors, when 3/4 of all data are involved into 
an exchange. 

The following steps of the algorithm - FFT in the direction z, 3-diagonal 
sweeps in the direction r, and inverse FFT in x - are performed on rearranged 
data independently in each processor. Finally, another transposition is required 
in order to return resulting data into the initial distribution. As a result, the 
parallelized procedure for the Fourier method would look as follows: 

FFT((p), transposition, FFT(z), 3-diag(r), FFT(x), transposition, FFT((p) 

4 New Method for Solving Poisson Equation 

The described procedure requires an exchange of 3D data arrays between proces- 
sors, while for the other steps of the algorithm only 2D boundary planes must be 
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transferred. Since the speed of interprocessor communications of modern parallel 
computers is much lower than their computational performance, this step would 
involve long delays and dramatically reduce the efficiency of parallelization. 

In order to lower the required amount of data transfer, the new method for 
solving pressure Poisson equation has been developed. The method is employed 
to 2D linear systems obtained after performing FFTs in the direction (p. It is 
based on the FACR (Fourier analysis with cyclic reduction) approach [5] and 
consists of 3 stages: cyclic reduction of the original matrix, solution of the re- 
duced linear system by the Fourier method, and substitution of results into the 
remaining equations. 

The method of cyclic reduction is used for simplifying 3-diagonal and blocked 
3-diagonal linear systems. One iteration of this method halves the number of 
equations in a system by the following way: 

Xi_2+AXi_i+ Xi = Ui -1 

Xi^i+Axi+ Xi+i = yi 

Xi T AXij^i T Vi^l 

If we multiply every second equation (i-th in this case) by —A and add two 
adjacent equations to it, we obtain the reduced linear system: 

Xi-2 + (2 - A^) Xi + Xi +2 = Vi-1 - Ayi + j/i+i 

Substituting = 2 — A'^ and = yi_i~ Ayi-\-yi+i, we obtain the system 
of equations of the same type and can therefore employ the cyclic reduction 
procedure again. After several iterations, the resulting system can be solved by 
any convenient method, with the following backsubstitution steps in order to 
find the remaining unknowns. 

In our case, the blocked 3-diagonal system is solved, where A is a 3-diagonal 
matrix itself. As a result, the new matrices A^‘^'> etc are no more 3-diagonal. 
However, they can be factored into simple 3-diagonal matrices, and the resulting 
systems can be resolved by several repetition of 3-diagonal algorithm. 

Every iteration of the cyclic reduction increases the complexity of the numer- 
ical algorithm and sophisticates the data exchange pattern. As a compromise, 
the 2-step cyclic reduction scheme has been chosen, with the 4-fold reduction 
of matrix size and amount of data exchanges in Fourier method. Despite the 
slight increase of data transfers in another parts of the algorithm, the resulting 
amount of transmissions is now on the reasonable level and doesn’t influence so 
much the efficiency of parallelization. 

5 Some Technological Aspects of Parallelization 

Parallel computational code expressed in high level language (Fortran in our 
case) has much more complicated structure than the sequential one. This com- 
plexity arises in particular from the increased number of sorts of subdomains with 
a variety of boundary conditions: external (physical), and internal (between sub- 
domains). In order to simplify the code flow, the alternating numbering scheme 
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Fig. 3. Standard (left) and alternating (right) numbering schemes 



is proposed. If, for example, the domain is split into 4 subdomains in some direc- 
tion, data elements (data points) in this direction are numbered in alternating 
order (Fig. 3). Due to this, codes in every two adjacent nodes (0 and 1, 2 and 
3) become more unified, and the total number of different boundary conditions 
is reduced. More important, all data exchanges are performed uniformly in all 
nodes. Additionally, the alternating numbering scheme naturally corresponds to 
the two-way parallel partition method for solving 3-diagonal linear systems. 

Another improvement of the algorithm concerns the solution of 3-diagonal 
linear systems with constant coefficients, that happens in the discretized pressure 
Poison equation and some other cases. For this sort of systems, the LU matrix 
decomposition can be performed in advance, thus reducing the computational 
work and eliminating data exchanges of matrix elements in parallel solution. 

The next point is a choice of communication library. The most standard one 
is the MPI. Unfortunately, some parallel systems may lack a MPI implementa- 
tion at all, or may offer more efficient option like SHMEM, GM or MPL. For 
this reason, the library-independent approach has been chosen, with a set of 
intermediate data exchange routines used instead of MPI. All library-specific 
calls are encapsulated within these routines. As a result, a parallel application 
program becomes system-independent. In order to adapt to any new communi- 
cation protocol, only a small set of routines must be rewritten. Sometimes, there 
exist incompatibilities in different implementations of the same library, or some 
compiler problems, and the library-independent approach is useful in this case. 

This approach also allows to accomplish some specific optimizations of data 
exchanges without modification of application code, such as splitting blocks to 
be transferred into smaller parts, or regulating duplex mode of transmission by 
some way. Another thing necessary for parallel optimization is the renumbering 
(remap) of allocated processor nodes, that can be important for better adapta- 
tion of a parallel computer topology (SMP-nodes, 2D-grids etc) to the structure 
of an algorithm. 

Up to now, the intermediate communication routines have been adapted to 
the following protocols: NX (Intel i860), Parix (Pars^ec), PVM, MPL (IBM 
SP2), SHMEM (Cray T3E, SGI) and MPI, the latter in different incompatible 
implementations. 

6 Comparison of Different Parallel Computers 

During the last years, many new parallel machines have appeared, including 
the novel class - multiprocessor node (SMP) parallel computers and clusters. 
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Table 1. Characteristics of the analyzed parallel computers 



parallel platform 
and 

interconnect 


CPUs 

per 

node 


CPU 

cache 

size 


theor. 

peak 

MFLOPS 


real 

code 

MFLOPS 


comm. 

library 


comm. 

duplex 

MB/s 


ratio 
MB/s to 
MFLOPS 


IBM SP2-375 
Colony switch 


16 


8M 


1500 


297 


MPL 


80-175 


0.27-0.59 


IBM SP2-120 
SP switch 


1 


128K 


480 


120 


MPL 


28 


0.23 


Alpha 21264-667 
Myrinet 


2 


4M 


1333 


347 


MPI 


90 

41-73 


0.26 

0.12-0.20 


PC Plll-550 
2xEthernetl00 


2 


512K 


550 


84.5 


MPI 


39 

5.8-10 


0.46 

0.07-0.12 


SGI 02000-300 
shared memory 


256 


8M 


600 


125 


MPf 


30-45 


0.24-0.36 



Combining several processors in a single node with common shared memories 
allows to isolate traffic between neighbour processors within this node, thus 
reducing internode communications. Also, the speed of intra-node exchanges is 
usually several times higher due to ’’directcopy” transfer in memory. 

The presented parallel code has been used for evaluating parallel performance 
of several computers, mainly of this new class, in order to reveal their commu- 
nication behaviour and applicability to this class of numerical problems. Ad- 
ditionally, an investigation of single processor performance and communication 
network characteristics has been performed. Main characteristics of all analyzed 
computers and some results of this investigation are presented in Table 1. 

The MFLOPS rates were measured by the single-processor version of the 
presented code, for the problem size 70 MB (128 x 64 x 92). In order to represent 
the real life situation for SMP nodes, and to account shared memory conflicts, 
the appropriate number of copies of this program were running simultaneously. 

The communication speed was measured by transferring large arrays of data 
(32 — 64 KB) in duplex mode. Both intra-node and internode exchanges are 
shown (top and bottom, respectively). When appropriate, measurements were 
performed in two regimes: heavy, when all CPU pairs exchange simultaneously, 
and light, when only one pair communicates without conflicts (shown as ranges). 

Table 2 and Fig. 4 present the parallelization efficiency results for 2 problems: 
of the fixed size (70 MB), and the scalable one (70 MB per processor). 

For the fixed size problem, a superlinear speed-up can be seen for SP2-375 
due to big L2-cache. When the subproblem’s size becomes comparable with the 
size of a cache, most data arrays fit into it entirely, and the computational speed 
increases sharply, compensating (fully or partially) the parallelization overhead. 

In general, these results illustrate a good correlation between communication- 
to-computation speed ratio and parallelization efficiency. However, for a big 
number of processors (16, and sometimes 8) there is some unexpected drop in 
efficiency for the scalable problem (IBM SP2-375, Alpha cluster, SGI 02000). 
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Table 2. Parallelization efficiency (%) for the fixed and scalable problems 



parallel platform 


hxed size problem | 


1 scalable problem | 


2 


4 


8 


16 


2 


4 


8 


16 


IBM SP2-375 


98.9 


98.0 


105.5 


102.3 


96.2 


86.4 


79.6 


71.5 


IBM SP2-120 


94.1 


87.0 


82.2 


74.7 


95.2 


91.4 


87.9 


84.9 


Alpha cluster 


91.8 


82.4 


83.5 


75.2 


90.5 


82.3 


82.2 


71.8 


PC cluster 


89.0 


82.0 


78.6 


66.4 


90.8 


86.2 


78.8 


74.5 


SGI 02000 


- 


- 


- 


- 


94.0 


85.3 


77.8 


57.0 




Fig. 4. Efficiency (%) for the hxed (left) and scalable (right) problems 



For the Alpha cluster, this can be explained by differences in processor node’s 
speed exceeding 5 % in some cases, and also by non-uniform (in time) behaviour 
of communication network. The IBM SP2-375 is supposed to suffer from multi- 
user environment when user processes migrate between SMP-nodes during their 
runs. The most unstable computer happens to be the 256-processor 02000. Due 
to its NUMA memory organization, even the performance of a single-processor 
job would vary by 50 % or more in different runs. For this reason, it was impos- 
sible to obtain any reliable results for this machine for the fixed size problem. 

The classical and well-balanced IBM SP2-120 with single-processor nodes 
demonstrates very uniform and monotonic behaviour in all regimes. 

The most interesting observations have been obtained for the dual Ethernet 
PC cluster. Despite its communication network is less balanced than that of 
the Alpha cluster (Table 1), the PC cluster demonstrates comparable level of 
parallelization efficiency due to better intra-node exchanges. 

With a proper organization of parallel algorithms, this sort of systems may 
compete to Myrinet-based PC clusters [9] with much more expensive commu- 
nication hardware. More advances implementations of PC clusters, build upon 
dual PHI and Athlon platforms (1 GHz or more) and interconnected by gigabit 
Ethernet, look as promising and economical solutions for coming years. 
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7 Conclusion 

The method presented in the paper allows to parallelize 3D CFD codes for sim- 
ulation of incompressible flows in regular domains. Despite the partially implicit 
nature of such codes and relatively low communication speed of modern comput- 
ers’ interconnects, this method ensures a reasonable level of parallelization effi- 
ciency. The method follows SPMD model and can be easily adapted to different 
architectures. The comparative performance analysis of several computers per- 
formed with the new code reveals their important characteristics and illustrates 
the correlation between communication speed and parallelization efficiency. 

This work was partially supported by the program ’’Reseau de cooperation 
universitaire et scientifique Franco-Germano-Russe” of the French Ministry of 
National Education, and by the Russian Foundation for Basic Research (grant 
RFBR-Ol-Of-00745). The access to parallel computers was given by CINES, 
France, and JSCC (Joint Supercomputer Center), Russia. 
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Abstract. As engineers are confronted with designing increasingly complex 
systems composed of interconnected components of diverse nature, traditional 
methods of modeling and analysis become cumbersome and inefficient. In the 
paper we discuss one of the approaches to modeling and distributed simulation 
of hybrid (discrete/continuous) systems. We use hybrid state machines, where 
sets of algebraic-differential equations are assigned to states, to model complex 
interdependencies between discrete and continuous time behaviors. This frame- 
work is fully supported by UML-RT/Java tool AnyLogic developed at Experi- 
mental Object Technologies. We use High Level Architecture (HLA), a de- 
facto standard for distributed simulation, as a communication and synchro- 
nization media for distributed hybrid simulation components. Integration of 
simulations developed with AnyLogic into HLA is considered. 



1 Introduction 

A large class of systems being developed has both continuous time and discrete time 
behavior. In fact, any system that interacts with physical world falls in that class. 
Chemical, Automotive, Military, Aerospace are areas most frequently mentioned in 
this respect. To model such systems successfully and to get accurate and reliable 
results from simulation experiments one needs an executable language naturally de- 
scribing hybrid behavior, and a simulation engine capable of simulating discrete 
events interleaved with continuous time processes. Additional problems arise with 
simulating hybrid systems in a distributed environment. 

There is a number of tools, commercial and academic, capable of modeling and 
simulating systems with mixed discrete and continuous behavior (so called hybrid 
systems), for a good survey we refer to [2] and [9]. We believe that the most conven- 
ient way of hybrid system modeling is to specify continuous behavior as a set of al- 
gebraic-differential equations associated with a state of a state machine. When a state 
changes as a result of some discrete event, the continuous behavior may also change. 
In turn, a condition specified on continuously changing variables could trigger a state 
machine transition - so called change event. State machines run within objects that 
communicate in discrete way, e.g. by message passing, as well as by sharing continu- 
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ous-time variables over unidirectional connections. Complex hybrid system modeling 
may require distributed simulation due to system complexity, performance and inter- 
operability requirements, etc. Developed simulation should interoperate with other 
components, possibly ereated with different tools. This could be achieved by using 
some M&S standard. This would also allow creation of distributed simulations, where 
components run on different machines and different platforms [12]. We believe that 
High Level Architecture (HLA) for Modeling and Simulation developed by US DoD 
([4], [5] and [6]) is the most suitable for this purpose. 

In the paper we present AnyLogic, a tool for modeling and simulation of hybrid 
systems and a way of HLA support integration in the tool simulation engine [3]. To 
demonstrate AnyLogie ability to model and simulate hybrid systems, we present a 
simple example - two tanks system [9] . We examine problems aroused with distrib- 
uted simulation of this system in AnyLogic using HLA. 

The paper is organized as follows. Seetion 2 presents AnyLogic tool and its model- 
ing language. A hybrid system example and its modeling in AnyLogic environment is 
deseribed in section 3. Section 4 gives an overview of AnyLogic simulation engine 
integration into HLA. Distributed model of two-tanks system designed in HLA is also 
deseribed here together with problems of hybrid system simulation in distributed 
environment. Section 5 concludes the discussion. 



2 AnyLogic and Its Modeling Language 

AnyLogic [1] architecture is shown in Fig. 1. 




Fig. 1. Architecture of AnyLogic Modeling and Simulation Environment 

Windows-based Development Environment includes graphieal model Editor and 
Code Generator that maps the model into Java code. The model runs on any Java 
platform on the top of AnyLogic Hybrid Engine. A running model exposes an inter- 
faee to control its execution and to retrieve information via a text-based protocol over 
TCP/IP. That interface is used by Viewer and Debugger that runs on Java platform as 
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well. The model supports connection of multiple clients from arbitrary (e.g. remote) 
locations. 

We have chosen a subset of UML for Real Time as a modeling language, and ex- 
tended it to incorporate continuous behavior. The language supports two types of 
UML diagrams: collaboration diagrams and statechart (state machine) diagrams with 
some changes. In collaboration diagrams we have added unidirectional continuous 
connections between objects (capsules in UML-RT) and the corresponding interface 
elements - input and output variables. 

The main building block of a hybrid model is called active object. The object inter- 
face elements can be of two types: ports and variables. Objects interact by passing 
messages through ports, or by exposing continuous time variables one to another. 

Object may encapsulate other objects, and so on to any depth. Encapsulated ob- 
jects can export ports and variables to the container interface, see Fig. 2. 



Input 



Relay 

Port 




End 

Port 



Fig. 2. AnyLogic Structure Diagram extending UML-RT with continuous connections 

An object may have multiple concurrent activities that share object local data and 
object interface. Activities can be created and destroyed at any moment of the model 
execution. An activity can be described by a Java function or by a (hybrid) statechart. 




Fig. 3. AnyLogic Hybrid Statechart 

In addition to standard UML attributes of states and transitions, in hybrid state- 
charts one can associate a set of differential and algebraic equations with a simple 
and/or composite state of a statechart, and you can also specify a condition over con- 
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tinuously changing variables as a trigger of a transition. The currently active set of 
equations and triggers is defined by the current simple state and all its containers. 

The example hybrid statechart in Fig. 3 is a simple model of an object that acceler- 
ates vertically up until it reaches the speed of Vmax, and then falls under the impact 
of gravity until it touches the ground (y <= 0), where it ceases to exist. 



3 Hybrid System Example 

Consider a system consisting of two tanks and a controller (Fig 4). Three valves con- 
trols water injection in tank 1 (v7), water flow from tank 1 to tank 2 (v2), and water 
flow from tank 2 outside the system (v3). Controller tracks water level in both tanks 
(hj and h^) and generates commands to open or close valves. The main task is to avoid 
droughts or overflows of tank 2. (AnyLogic demo with this and other examples is 
available from http://www.xjtek.com.) 




Fig. 4. Two tanks system example 

As we can see from the system description, there are two components: two tanks 
and the controller. Structure diagram of two tanks component of AnyLogic system 
model is presented in Fig. 5. Output variables hi and h2 are used to expose water 
levels in tanks 1 and 2 respectively. 

d(hl)/dt = (vin - vl2)/Sl, d(h2)/dt = (vl2 - vOut)/S2, (1) 

SI =Jt(dl/2)^ S2 = Jt(d2/2)^ 

h = 0.39, dl = 0.12, d2 = 0.05, vIn = 400/1000/3600, heights of both (2) 

tanks are 1.0, t = 0.9 , 1 = 0.3 
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Ports vXOn and vXOff am used to receive commands for appropriate valves. Vari- 
ables kl, k2, pi and p2 are used in hybrid statecharts tmckVl, and trackV2 to model 
water flow through valves v2 and v3 while they are in transit from opened to closed 
states and vice versa. 



TwoTanks 



@ animationUpdateJ 
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trackVInput 
trackVl 
% trackV2 
trackH 



hi 


twoTanks 
h2 < 


h2 


vInputOn 


vlnputOni 


vInputOff 


vlnputOffl 


vlOn 


vlOni 


vlOff 
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Fig. 5. Two tanks component structure diagram. 



Fig. 6. Overall system stmcture 



Fig. 7 presents hybrid statechart trackVl. Controller component consists of one 
statechart implementing its logic (initially fill tanks and then track water level in tank 
2), opening valve v3 when h2 goes below 1 and closing it when h2 rises above I*. The 
overall system structure diagram is presented on Fig. 6. 




Fig. 7. TrackVl activity (hybrid statechart) 




□ root.twoTanks.hl ■ root.twoTanks.h2 



Fig.8. Simulation results of AnyLogic model of two tanks problem. 
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Note the continuous variable connection on h2 between components. Simulation 
results of the system with equations (1) and parameters (2) are presented in Fig. 8. 



4 Distributed Simulation with HLA Support 

The High Level Architecture (HLA) is a standard framework that supports simula- 
tions composed of different distributed simulation components. The HLA was devel- 
oped by the Defense Modeling and Simulation Office (DMSO) of the Department of 
Defense (DoD) to meet the needs of defense-related projects, but it is now increas- 
ingly being used in other application areas [11]. The primary goal of such architecture 
is to facilitate simulation interoperability and reuse across a broad range of applica- 
tions [7,8]. Recent adoption of the HLA as an IEEE standard will strengthen positions 
of this architecture among other modeling and simulation standards. There are exam- 
ples of the successful creation of distributed simulations composed of components 
developed using different tools [10]. 

The HLA follows a framework approach and is defined by three major elements: 

— Rules [4] govern the behavior of the overall distributed simulation (Eederation) 
and their members (Eederates); 

— An Interface Specification [5] prescribes the interface between each federate and 
the Runtime Infrastructure (RTI), which provides communication and coordination 
services to the federates; 

— An Object Model Template [6], which defines the way federations and federates 
have to be documented (using the Federation Object Model and the Simulation 
Object Model, respectively). Federations can be viewed as a contract between fed- 
erates on how a common federation execution is intended to be run. 

In simulation, HLA plays a role similar to one CORBA, COM-n, etc. play in object 
oriented distributed software development. 



4.1 Integrating Simulation Engine of Any Logic and HLA 

Integrating HLA support in AnyLogic will give the user the possibility of rapid crea- 
tion of prototypes of component simulations (federates) and development of distrib- 
uted simulations (federations) using convenient and powerful graphic environment of 
AnyLogic. When a prototype developed in AnyLogic proves its ability to deal with 
the problem it is intended to solve, one or more of the federates could be re- 
implemented using, for example, raw HLA interface on one of the high-performance 
languages such as C-H-. We believe that such approach for federation development 
could reduce the time and cost of simulation development and avoid many errors on 
early phases of the development process. 

HLA integration in AnyLogic requires careful consideration. The most difficult 
problems arise in distributed simulation because of hybrid nature of simulated system 
components. 

AnyLogic simulation algorithm could be represented as follows: 
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1. If there are current events (events that are 
scheduled to occur at the current model time) , 
then randomly select one and execute it. This 
step is called event step. The properties of 
the event step are: 

No model time elapses 

Some actions defined within the model may be 
executed 

As a result, the state of the model may change 

2 . Otherwise (no current events scheduled) , the model 
time could be advanced to the time of the next 
discrete event scheduled (if any). The properties 
of this time step are: 

The model time progresses 

The discrete state of the model remains unchanged 
Active algebraic-differential equations are 
solved numerically and the variables are 
changed correspondingly 

Awaited change events (predicates on continuously 
changing variables which could change discrete 
state of the system) are tested for 
occurrence. This event is scheduled as current 
event and the algorithm proceeds to Step 1 . 

If we wish to build time regulating and/or time constrained federate with Any- 
Logic, we should modify simulation engine to allow coordinated time advancements 
of all distributed simulation participants. The most general technique to achieve such 
coordination is using zero lookahead value (for time regulating federates) and Next 
Event Request Available (NERA) HLA Time Management service call. NERA(t) 
service allows delivery of all queued RO messages. It grants time advancement to the 
time t (if no more TSO messages will be delivered with time stamp less then t) or to 
the time tl<t, where tl is the lowest time stamp of all scheduled TSO messages. Then 
it delivers this TSO message to the federate. Usage of NERA service call allows send- 
ing and/or receiving additional TSO messages scheduled at the current time and al- 
lows seamless integration of local AnyLogic and HLA simulation engines. 

AnyLogic simulation algorithm with HLA support could be represented as follows: 

1. t = to. Detect discrete change events; 

2. if there are current events, then 

Choose one and execute it 
NextEventRequestAvailable (tO) 

Awaiting TimeAdvanceGrant ( tO ) callback 
Goto 1 

3. tl = f indNextCont inuousEvent ( to , min{ Tnext , Tmaxstep } ) 

4. NextEventRequestAvailable (tl) 

5. Awaiting TimeAdvanceGrant (t2 ) callback (t2 <= tl) 

6. to = t 2 

7. Goto 1 



Here Tnext is the time of the next event scheduled, and Tmaxstep is a constant. 
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AnyLogic provides user-accessible service for determination of time stamp of the 
next continuous or discrete event. Extension of AnyLogic simulation engine with 
such services allows integration of HLA support as an add-on package. 



4.2 Distributed Simulation of Two Tanks Problem with AnyLogic 

HLA support module for AnyLogic that is currently under development allowed us to 
build distributed simulation of described two tanks system model. System has been 
represented as one object class TwoTanksSystem with attributes TanklLevel 
and Tank2Level. Interaction class ValveState with parameters Valve and 
IsOpen allows federation participants to change valve states of the instance of the 
system. The HLA federation consists of three federates: two tanks simulator federate, 
controller federate, and viewer federate providing visualization of the process. Two 
tanks simulator federate publishes object class TwoTanksSystem with both attrib- 
utes, creates and registers one instance of that class and updates its attributes. It also 
subscribes to ValveState interaction and translates it to the messages to appropri- 
ate ports (vXOn and vXOjf). Controller federate subscribes to object class Two- 
TanksSystem with both attributes and later will discover object instance created by 
two tanks simulator. It also publishes interaction class ValveState to be able to 
send interactions of this class. AnyLogic models of system components have been 
wrapped by another ActiveObjects responsible for registration or discovery of in- 
stance of appropriate HLA object class, updating instance attribute values (periodi- 
cally), and translating commands previously sent via ports to and from HLA interac- 
tions. Additional ActiveObject called HLATime Advancer has been added to every 
federate to synchronize local simulation engine time with federation time by request- 
ing HLA RTI for time advancements as described in previous section. 

Distributed simulation of the model shows overflows of tank 2. Because model 
logic has been left unchanged, the problem source is in breaking connections between 
two tanks and the controller (Lig 6). Connections between ports of the components 
have been represented as interactions between distributed components. Sending or 
receiving message to/from port is a discrete event, thus no information has been lost 
by such representation. But transmission of continuous time variable h2 only in dis- 
crete moments of time (periodic updates) with update period greater then some At 
will not allow controller properly react to the change in water level. Previous update 
may indicate normal level (below t), but the next one may show very high level or 
even overflow. This is an instance of more general sensitivity problem. 

So, we are facing problem with exposition of continuous time variable in distrib- 
uted simulation of hybrid system and detecting condition defined on it in another 
distributed component. 

There are no significant conceptual problems with building distributed simulations 
of discrete event systems according to system state updates. Situation changes when 
one or more components have continuous time or mixed (hybrid) behavior, which 
they want to expose to other components. 
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The problem is to represent hybrid system as a discrete event system at the level of 
distributed components interactions. 

Three general approaches could be proposed: value polling, sampling along time 
or value axis, and an approach which provides ability for one component to define a 
predicate on a variable, which is to be evaluated locally at another component along 
with notification when such event occurs. 

First two mentioned approaches are quite obvious. Their disadvantage is that they 
can’t solve the problem of guaranteed correct detection of conditions defined over 
continuously changing interface variables. Below we propose an approach, called 
Remote Predicate Evaluation, which can cope with those difficulties. 



4.3 Detecting Conditions Defined 

over Continuously Changing Interface Variables 

Polling and sampling update methods are good enough when we need to monitor 
behavior of components, e.g. for building external viewers, statistic collecting, track- 
ing objects, and other situation awareness needs. However, these update methods are 
not very good for detecting conditions defined over continuously changing interface 
variables (like h2 in the distributed two tanks system). 

There are situations when we are interested in the value of a predicate (condition) 
defined on a continuous time variable. This, for example, may affect discrete state of 
the system or trigger some actions associated with this event in other components. 
Remote Predicate Evaluation (RPE) is the method when such predicate could be 
checked locally within the component, which exposes the variable(s) while solving 
algebraic-differential equations. It could provide required accuracy in determining the 
moment of time when this event occurs. Besides it allows distributed model designer 
to lower the probability of sensitivity problem appearance and minimizes overhead 
caused by frequent variable value updates (only the fact of event detection is an- 
nounced to other interested components). Einite amount of information is required to 
transfer both a predicate and a notification over the network. 

It could be recommended to design distributed simulation in the way that compo- 
nents “encapsulate” their continuous behavior, exposing continuous time variables 
(attributes) only for the needs of situation awareness, visualization and remote statis- 
tic collection at relatively low rate. Detection of all required events identified during 
simulation (federation) design and development is then performed inside the compo- 
nent and the corresponding notification is sent to all interested components. 

Sometimes, however, simulation components model devices with analog output, 
which is continuous by its nature (e.g., electrical current or voltage interface). A 
model designer may not know a priori which conditions will be interesting for com- 
ponents connected to this device during distributed simulation execution. In this case 
a mechanism for dynamic creation and modification of predicates on output variables 
can be implemented. 

If the designer has a priori knowledge of the form of the predicate, she could pa- 
rameterize it and allow other components to change parameters during simulation 
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execution, tuning this component for their needs. In HLA this effect could be 
achieved, for example, by declaring attributes with transferable ownership represent- 
ing predicate parameters. Or it could be implemented using specialized interaction 
exchange protocol. 

The latter method was implemented with prototype HLA support add-on to Any- 
Logic for distributed simulation of above-mentioned two tanks problem. Controller 
tracks value of h^ and when it reaches some dangerously high level L*, it commands 
appropriate valve to open. In other words, controller component defines predicate on 
continuous variable h^ in the form h^ > L*. Since h^ is modeled in another component, 
the hj update method can directly affect the time delay before open valve command is 
issued and thus can lead to tank 2 overflow (this is just what we’ve got during distrib- 
uted simulation using sampling). The same system simulated as a single AnyLogic 
model does not show such overflow. Here we can see that distributed simulation of 
hybrid system can show wrong results just because of continuous variable update 
delay. 

The solution of this problem is to allow controller federate define L* value for the 
two tanks federate. We have done this by defining HLA interaction with parameter 
specifying L* value. After interaction reception two tanks federate changes parameter 
of the predicate on \ > L’^) and evaluates it during solving system of differential 

equations. After predicate becomes true (an event detected), the federate updates 
values for h^ and h^ and the controller detect this event with minimal possible error. 
As the predicate is evaluated “remotely” by the federate simulating continuous time 
variable, we call this method Remote Predicate Evaluation. It showed its ability to 
deal with the described problem for this particular example of distributed simulation. 

RPE has several obvious limitations. Eor example, it cannot help if we have predi- 
cate on more than one continuous time variable simulated by different distributed 
components. Eor this case revision of model partitioning into distributed components 
could be advised. Obviously, subcomponents tightly coupled by continuous variables 
should be placed in the same distributed component. 



5 Conclusion 

Hybrid statemachines approach implemented in AnyLogic modeling and simulation 
environment is a powerful and convenient formalism to describe behavior of the real 
world systems. AnyLogic itself is a very flexible tool, it is essentially an environment 
for programming on Java with modeled system visual specification support in terms 
of simulation class library. This property of AnyLogic makes it relatively easy to 
develop HLA support extensions and enable components created with this tool par- 
ticipate in distributed simulations. 

There is a single common standard of distributed simulation in military domain. 
But there’s no such standard in civil domain. HLA adoption as IEEE standard will 
improve situation. It can help with interoperability and reuse of simulations created 
with different tools. HLA is suitable for discrete event systems, but problems arise 
with distributed hybrid systems modeling with HLA. Distributed execution of com- 
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ponents connected on continuous time variables and detection of events related to 
them may lead models to demonstrate results which differs drastically from the simu- 
lation of the same system locally. One example of such problem has been demon- 
strated and solution named Remote Predicates Evaluation has been proposed. 
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Abstract. The Time Domain Reflectometry probe is a new technique applied 
to moisture measurement. It is a wave guide stuck in the ground. The wave 
crosses a variable electric properties medium . We develop a model based on 
the resolution of Maxwell’s equations which allows to determine the 
electromagnetic field and the energy density. The problem of a wave front 
propagation presents a very large CPU cost. So we develop a parallel 
computing approach based on C++ Objects Oriented Programming , Finite 
Element Method and selected data technique. We associate the SIMD 
technology with the MPI C++ library for software implementation. High 
performances computing are obtained. 



1 Introduction 

Water resource management for plants is becoming an increasingly acute problem. 
This is associated with pollution phenomena usually caused by fertilizers. It is 
therefore essential to have a precise idea as to the soil’s moisture content, both at 
surface and underground levels. Various techniques have been used to this effect, e.g. 
the uneasy technique of sampling. The latest generation of instruments [1] under 
development is based on Time Domain Reflectometry — TDR — . These instruments 
come as two or three parallel metal rods driven into the ground down to little over 
three feet — or 1 m — . An electrical impulse is applied to the end bit above ground 
surface. The impulse propagates along the rods. Measurement and processing of the 
signal obtained by reflection should theorically allow the determination of local 
moisture levels. In this paper we shall present a parallel computing method for the 
numerical study of an electromagnetic echography of the ground. So we use the 
Maxwell’s equation model by a Finite Element approach. 



2 General Presentation of the Model 

The wave guide (Fig. 1) is represented by two parallel electrodes and the electric 
conductivity of these is infinite. The free space around the electrodes can be an ohmic 
conductor. The electrical impulse is applied to the end bit above ground surface and 
we study the Transverse Magnetic mode -TM- in the wave guide. 

We formulate the Maxwell’s equations with the usual notations (1). For a 
numerical model we prefer to use adimensional equations. The Maxwell’s equations 
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show a non dimensional parameter Rm . The non dimensional Maxwell’s equations 
(E, H) heeome [2]: 



kl 



dH 

dt 



= -curlE 



div(\p,^\H) = Q 



[Sf = curl H - Rm .[y^ ]e div ([e^ ]E) = 0 

5t 

3 Finite Element Formulation 

The 2D model is presented in Eigure 1. We discretize with triangular linear elements. 
We have 13,985 triangular elements and 7,300 nodes. 29,200 differential equations 
are generated by a Einite Element process [3]. 




Fig. 1. Geometrical model and mesh of the medium. 



3.1 The Code FAFEMO and the Automatic Multigrid System (AMS) 

We use an efficient C++ Object Oriented Programming for the Finite Element code 
called FAFEMO (Fast Adaptive Finite Element Modular Object) [4]. This technology 
allowes the implementation of very low sized solvers ( 29 Kb - 700 lines ). In this 
context, our numerical calculus uses a technique called the AMS . For each time step, 
the determination of the computational area and the selection of the elements 
dedicated to each processor are determined on the full grid [5]. 

3.2 Grid Optimization 

The AMS expert system should be built for each problem. Mathematical, numerical 
and physical considerations can be used. Before the parallelization of the algorithm 
the finite element grid should be optimized. In our example the celerity c of the wave 
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is known. Only the active elements are selected according to the front position. We 
denote a time dependent active grid. In our case the active degrees of freedom number 
and the element number increase in time. 

The working of the AMS reduces previously the size of the differential system 
before the application of the parallel technology. 



4 Numerical Resolution 

After a classical assembling operation where index (G) denotes the global values, the 
differential system of reduced size is as follow : 

fFif'l . . ® 

Under these conditions we can test the semi-implicit method. We use a matrix-free 
technique [5] , the mass matrix and the stiffness matrix never being built. We notice a 
high performance level for the CPU and the storage costs for vectors only. 




4.1 Principle of Parallelization 

The principal CPU cost corresponds to computing of the elementary matrices 
(me,ke,fe) and secondarily the time step updating. In the example of an unsteady 
problem, the analytical discretization of the problem with the Finite Element Method 
gives the following scalar product [3] : 

^ ^6uey([me]|-^^^| -I- [ke]{ue}-{fe}) = 0 with NE = 1 .. ne 
NE I dt J 

If p is the number of processors, we select a list of elements Nk by the AMS expert 
system: 

p (4) 

y Nk = NE and Ni R Nj = 0 for i A j 

k=l 

Each elementary matrix can be assembled into a global matrix by a classical Finite 
Element process [3]. With p processors the dispatching of elements in the following 
list is : 

processor j: list Nj 

XkJ=K] 

Nj Nj 



4.2 Parallel Algorithm 

In this case the Bernstein’s conditions are verified. So we have a correct load 
balancing if the list of elements is similar for each processor. The arrays definition 
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depends on the technology. We use the SIMD - Single Instruction Multiple Data- 
associated with the MPI - Message Passing Interface - C++ library. The 
communications between the processors exist only at the end of the time step. Only 
the global diagonal mass matrix would be constructed before the updating of the 
solution. After each processor builds its part of the differential system and the below 
algorithm allows the updating of the solution {U}. A parallel semi-implicit algorithm 
is used: 



while 

Uorj=uop [ic/;}, =aj,,[m;]-'.{+,(c/.+oac/;-',j,+o.ajJ}| 

I i = 1, 2, . . . until ||AC/^ - AC/^“‘ | < tolerence J 

^n+\ ~^n 

end while 

where a is the upward time-parameter. Eventually if a < 0.5 a stability condition is 
required [6]. 



5 Numerical Results 

In this case, we choose an example of variable electric properties in space. Around 
point (yo, Zq) there is a spot of electric singularity and the e (y , z ) value is modified 
by a mathematical model of a moisture spot. 

Figure 2 presents the active zone at time t = 1.3 and figure 3 presents the 
adimensionnal electromagnetic energy at the same time. The electric singularity 
profile is a white circle. If the time is greater than L/c, then the electromagnetic field 
extends beyond the end of the wave guide. Figure 3 shows the coming out of the 
electromagnetic wave in the free space. We notice a large dissymmetry because of 
the electric particularity of the medium. The celerity of the electromagnetic wave 
decreases strongly in this zone and a more important part of the wave is reflected 
toward the entrance of the wave guide (Fig. 3). It therefore provokes a variation in the 
impedance of the wave guide, in particular, as seen at the entrance. 

Figure 4 and 5 show respectively the part of elements dedicated to processor #1 
and processor #2. We have here a graphical resemblance with the domain 
decomposition method because the numeration of the elements by the mesh generator 
is sequential. Table 1 presents the numerical caracteristics of our parallel computing. 
This results concern the full grid but the AMS expert system can reduce the 
equation’s number by active grid optimization (Fig. 2). 

The code FAFEMO associated with the AMS capabilities allows to use a very low 
sized memory. Table 2 presents some results of the used memory. 
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Fig. 2. Computational area at time t = 1.3 . 




Table 1. Summary of the calculs properties. 



Equations 


CPU time 


Nb. of processors 


Speed up (%) 


29200 


29 mn 31s 


1 


— 


29200 


16 mn 42 s 


2 


88 % 
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Fig. 5. Part of elements dedicated to processor #2. 



Table 2. Summary of used memory. 



Processor 


Source Code 


Memory Full Grid 


Memory partial Grid 


#1 


29 Kb 


3.8 Mb 


1.8 Mb 


#2 


— 


3.5 Mb 


1.6 Mb 
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6 Conclusion 

Generally the wave propagation with a front is always a difficult numerical problem 
with a large CPU cost. We present a general Finite Element formulation for 
Maxwell’s equations in the case of propagation phenomenon . An electromagnetic 
wave propagates along the wave guide with the object determining the variable 
electric properties of the space crossed. This process is performed by the AMS with a 
time dependent number of unknowns. At these techniques we associate a parallel 
computing technology based on the selected data with SIMD and MPI technologies. 
The first tests of calculus are performed with an usual PC 2-Processors Pentium 1 
GHz. In this way the CPU and memory cost are reasonable. This paper shows that the 
Finite Element Method, the Object Oriented Programming and the selected data by 
AMS expert system constitute a coherent set of techniques for an easy use in 
engineering problems. 
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Abstract. We develop a coherent set of techniques for parallel computing. We 
use the Finite Element Method associated with the C++ Objects Oriented 
Programming with only one database. A technique of data selection called 
AMS - Automatic Multigrid System - is used for the determination of the data 
dedicated to each processor. This method is performed by a SIMD technology 
associated with the MPI capabilities. This parallel computing is applied to very 
large CPU cost problems particularly the unsteady problems or steady problems 
using iterative methods. Different results in Computational Fluid Dynamics are 
presented. 



1 Introduction 

In this paper we will present a parallel computing method for engineering problems 
by a Finite Element approach. Several methods are used for the parallel computing, 
such as the domain decomposition [1]. We propose a coherent set of techniques for an 
easy implementation including: 

Finite Element Method, 

C++ Objects Oriented Programming, 

Selection data technique. 

Matrix free technique and iterative method. 

We develop an easy method for parallel computing which seems to be a natural 
way to perform intensive computation. Our purpose is to carry out parallel algorithms 
without modifying the object structure of the solvers and the data structure. To answer 
this requirement, we use a selected data method resulting in suitable load balancing 
with the choice of lists of elements. This technique is independent of the geometry, 
and can be applied in general cases. This new concept is a natural way for the 
standardization of parallel codes. In fact, parallelization is here applied to the 
resolution of the nonlinear system by matrix free algorithm. The domain of potential 
applications is very wide and several examples are presented [2]. 
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Among different hardware concepts the SIMD — Single Instruction Multiple Data — 
architecture has proved to be the most promising for parallel computer. This 
technology is used for the high performance computing especially when problems 
such as solving large set of differential equations are dealt with [3]. A SIMD parallel 
computer consists of a set of processors connected with a fast communication 
network. Each processor performs the same program with different data. In our work 
the different data are obtained with a single file and each processor selects its 
concerned data. For the parallel programming we use the MPI — Message Passing 
Interface — library. 



2 Structure of Code 

Figure 1 shows the general structure of the compact code. It is organized into three 
classes corresponding to the functional blocks of the Finite Element Method’s 
different stages. With these classes we built three objects that are connected by a 
single heritage. So the transmission of the parameters between the objects is defined 
by a list technique. 

We use efficient C++ Objects Oriented Programming for the Finite Element code 
called FAFEMO (Fast Adaptive Finite Element Modular Object) [4]. This technology 
allows an implementation of very low sized solvers. In our examples their sizes are 
about 31 Kb — 900 C++ lines — . Each solver is dedicated to a problem and can be 
considered as an element of an algebraic structure [5]. 



Finite Element .Analysis Stmeture of the solver 




Fig. 1. Object structure of a standard solver. 
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3 Method of Parallel Computing 



3.1 Principle of Parallelization 



The principal CPU cost corresponds to the elementary matrices computing and 
secondarily to the time step updating. In the example of an unsteady problem, the 
analytical discretization of the problem with the Finite Element Method is given by 
the following scalar product [6]: 

^ ^6ue).([me]|-^^^| + [ke]{ue}-{fe}) = 0 with NE = 1 .. ne 
NE I dt J 



Generally the matrix free technique is used and we consider only the elementary 
residuum {\|/e}: 



^(6ue).([me].|-^|-Ve}) = 0 



(2) 



If p is the number of processors, we select a list of elements Nk: 
p 

y Nk = NE and Ni R Nj = 0 for i j 

k=l 



(3) 



Each elementary matrix can be assembled into a global matrix by a classical Finite 
Element process [5]. We obtain: 

p p (4) 

[Mk] =[m] global mass matrix and y^lTk} = {'!'} global residuum 

k=l k=l 

In this case the Bernstein’s conditions are verified [4]. So we have a correct load 
balancing if the list’s size of elements are similar for each processor. The 
communications between the processors exist only at the end of the time step. Each 
processor builds his part of the differential system and the below algorithm allows the 
updating of the solution {U}. A semi-implicit algorithm is used [6]: 

=0 

while (f„ < ) 

{for j = Uo p {AU:}^=At„.[M:l'.[T,(u„+a.A[/rT„+a.At„)}| 

I 1 = 1,2,... until — AU tolerence | 

t =t + At 

f? + l ti n 

end while 



where a is the upward time-parameter. If a< 0.5 a stability correction is required [6]. 
For the above algorithm we dispose of a technique [7] for an easy diagonalization of 
the matrix [M] . 
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3.2 Technique of Parallelization 

Each solver is endowed with a capability called AMS (Automatic Multigrid System). 
It is an expert system with several possibilities. The applications of this capability are 
very large: 

Multiprocessor computing (in this paper), 

Wave front, 

Multidomain calculus. 

Moving boundary. 

Multigrid simulation, ... 

According to the problem, the AMS expert system can choose different analytical 
or geometrical components (Fig. 2): 

if the element i is (dis)activated (active_element[i]=false/true): 
all nodes of this element are (dis) activated: 

(active_node[j]=false/true, j=l .. nn) 

all degrees of freedom of each node j are (dis)activated: 

(active_dof[k]=false/true, k=l .. nd) 

The converse is true. 

In the case of parallel computing the AMS expert system chooses here the 
elements dedicated to each processor for the sharing of the scalar product (1). 




Fig. 2. Taxinomy of the Finite Element parameters. 

In fact we can summarize the principal stages of parallelization: 

A first stage consists to create an expert system for the selection of the data for 
each processor. 

In the second stage each processor calculates the concerned elementary matrix 
without communication per step. 

In the third stage each secondary processor sends its assembled elementary 
matrices to the principal processor. So it can update the step. 
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4 Application 



The classical test problem is the flow of a dilatable fluid in a square cavity [1]. We 
use the AMS capabilities for multiprocessor computing with a two- or four-processors 
computer with the precedent algorithm. It also presents a driven cavity flow and 
thermo-convection flow [8]. With the usual notations [1], the adimensional Navier- 
Stokes equations can be written: 



du: 

-^=0 

dXj 

du- 1 dUf 



dT 1 dT 

1 .U 

dt Pr dx: 





+ Ra.T.S.^ 



1 d^T 
Pr dx^j 



(5) 



We use a classical Galerkin formulation associated with the Taylor-Hood element 
[9], for a steady problem the Navier-Stokes equations (5) become a nonlinear system: 

[K(U)]{u}={f} with {‘P}={f}-[k1{u} (6) 

An iterative method is used and the above nonlinear system is resolved into the 
iterative following algorithm using the relations (4): 

forj = ltop {AU'j}=[A]-‘.fFj(u')} 

{u‘'^‘}={u‘}-h{AUj} 



i = 1, 2, . . . until Max j |aU ■ | < tolerence 

where [A] ‘ is a diagonal preconditioner [10]. So this algorithm is similar to an 
unsteady problem. It is matrix free and can be dispatched to each dedicated processor. 
Thereby, no communications are required between the processors. Each of them 
performs a completely independent computation for each iteration. This is particularly 
well adapted to the object structure of the solver. The SIMD architecture is used for 
the parallel computing management. The AMS capabilities select the data for each 
processor. The corresponding software is developed with the MPI C-H- library. We 
should notice that the parallel solver is almost the same as the one used in a sequential 
process. These examples are performed on a Silicon Power Challenge computer. 

The different stages of the parallel computing are applied to CFD problems. 
Different cases of driven square cavities with thermoconvection are presented in 
figures 3, 4 and 5. The structured mesh is here 20 x 20 i.e. 800 triangles and the 
velocity vectors are plotted. In this case, we use 2 or 4 processors and the list of 
elements is shared in 2 or 4 equal parts. Initially we have separated vortices and each 
picture is an algebraic representation of different stages of the previous iterative 
process. When the iterations number increases, the separate vortices merge and give 
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the exact solution [1]. We have here a graphical resemhlance with the domain 
decomposition method because the numeration of the elements is sequential 
according to the coordinates axis. These results are similar to that of the reference [1]. 






Fig. 3. Velocity vectors in a driven cavity calculated with 2 processors. 



Table 1. Summary of calculus properties. 



Nb of processors 


Equations 


CPU time-10 
iterations 


Speed up 


1 


2554 


2 mn 43 s 


— 


2 


2554 


1 mn 33 s 


90% 


4 


2554 


0 mn 47 s 


87 % 



The efficiency of the parallel computing is summarized in the Table 1. 



5 Conclusion 

An easy method of parallel computing for engineering problems is proposed. It 
consists to use a coherent set of techniques including : 
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Fig. 4. Velocity vectors of a thermoconvection problem calculated with 2 processors. 

- Finite Element Method, 

- C++ Objects Oriented Programming by FAFEMO software, 

- Selection data technique by AMS expert system, 

- Matrix free algorithms. 

In this context the implementation of the concerned low sized solvers is very easy. 
The SIMD architecture associated with the MPI-C++ library is used. So we dispose of 
an efficient method for the parallelization of differential systems coming from the 
Finite Element Method. The performances are interesting. We notice particularly the 
low sized memory and the good load balancing. Different examples in Computational 
Fluid Dynamics are presented. 
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Fig. 5. Velocity vectors in a driven cavity calculated with 4 processors. 
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Abstract. In the paper it is suggested a correction of the Bird’s al- 
gorithm in the DSMC method. It takes account of real distribution of 
collision events inside the time steps At and actual trajectories for the 
collided particles there thus diminishing asymptotical order of the error 
in time evolution from 0{At) to 0{{At)^). However the structure of the 
algorithm turned out to be more complicated and parallel implemen- 
tation of it becomes a new problem. As some solution of this problem 
the corrected DSMC method in its domain decomposited version was 
applied for simulation of unsteady flow in a two-dimensional cavity with 
a moving bottom. The numerical results of this simulation presented in 
the paper show a noticeable artificial acceleration of changes for system 
parameters by the rmcorrected version in comparison with the corrected 
one as the former locates all collision events from the previous collisional 
time step at one time point at the beginning of the space motional step. 
The difference between their results in calculation of the mean velocity 
circulation along the identical loops in its time development increases 
proportionally to value of the time step 0(At) used and to mean molec- 
ular collision number on the distance from the source of perturbation to 
a measuring point. 



1 Introduction 

Monte Carlo simulation is widely used in rarefied gas dynamics for solution of 
very different problems. The most successful provs to be version of it intro- 
duced by G.A.Bird [1],[2] and known as direct simulation Monte Carlo (DSMC) 
method. In this version molecular collisions and their space motion are splitted 
within a small time step At into two separate, one after another, computable 
processes. So that after a collision time step the collided molecules with its 
newly acquired velocities just from the very beginning of the subsequent space 
motional time step are moved to its new space positions. Thus all collision events 
are located at one time point in the step beginning instead of being distributed 
somehow within it as in real evolution. Thus the error of the DSMC method is 
of order 0(At). In order to diminish this error it was introduced a trajectory 
correction [3], [4] which is shortly described in the next section along with its 
further development and some consequences for parallel implementation of the 
corrected DSMC method. 
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2 Background and Trajectory Correction 

As it was underlined in papers [3], [4] an important point of the DSMC method 
is the assumption that molecules are located equally probable inside their cells. 
This is related to the nature of the gasdynamical measurements where parti- 
cle density inside of small space volumes with linear size say R\ could not be 
determined during the measurements. So that for any state of the gas system 
such kind of an assumption is also understood not contradicting yet long-term 
scientific experience. 

It should be also mentioned that the time of the gasdynamical measurements 
is limited from below. And characteristic times of all the processes must be 
much larger than some time interval Ati, commonly At\ w where c is 

a characteristic speed. It is true, for instance, for local optical Doppler’s shift 
measurements of the molecular velocity in a cell. Usually all parameter’s changes 
during this time step Atx are so small that the system could be considered as 
quasi-stationary within it. By choosing DSMC decoupling time step At being of 
order of this value Atx we can take advantage of using some results of Khinchine 

[5] for stationary random processes. 

Indeed in [3], [4] it was introduced random collisional fiuxes - the probabilities 
of some encounter for a pair of molecules in a cell for the time between t and 
t + dt during collisional time interval At. Superposition of all such fiuxes in a 
cell forms a stochastic process. Through asymptotical estimates it was shown 

[6] that constituent random fiuxes mutually independent, stationary, ordinary 
and with limited after-effect. So that applicability conditions for the Khinchine ’s 
limit theorem [5] were verified. The superposition of these random fiuxes proves 
to be then a Poisson process, the number of collisions in a cell being the 
Poisson variate. 

In fact due to decoupling of the molecular space motion and mutual molecular 
collisions, a collided particle in the Bird’s algorithm moves straight from the 
very beginning of the subsequent space motional time interval At with its newly 
obtained velocity cl, instead of running at least over two asymptotes of a real 
trajectory first with an old velocity cj and then with the new one cl after an 
apex at the collision time point tc somewhere inside At. The probability of 
the apex location because of the mentioned above stationarity does not depend 
upon time hence the probability of an encounter between t and t + dt within At 
is equal to dt/At. So time point of an encounter tc can be simulated simply by 
tc = rnd{)*At, rnd() being the next random number. The complete displacement 
Sj which includes such a collision is given by 

®j = - tc) 

As one needs here the values of both velocities before cj and after an en- 
counter cl it is natural to calculate the space motion of collided particles within 
the collisional time step. The probability of two collisions during At for the same 
molecule being proportional to 0(((rnvAt}^) where cr is molecular collision cross 
section, n - concentration, v - relative speed, is considered to be vanishing small 
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and neglected. So the error in time is correspondingly proportional to O(At^). 
Though expressions for nonstationary corrections and collisions for molecules 
from different cells have already been obtained in [3] for simplicity they will not 
be applied here. 

This is the main difference of the present correction from the traditional 
Bird’s algorithm [1],[2]. In papers [3], [4] for the sake of diminishing computa- 
tional cost it was used simplified version of the trajectory correction with the 
average displacement E(s^) instead of complete simulation which through per- 
haps excessive simplification has allowed in that time to maintain the structure of 
the Bird’s algorithm. Yet true complete simulation demands to change its struc- 
ture and parallel implementation of this corrected DSMC method then presents 
a new problem. The next section describes parallel, domain decomposited im- 
plementation of the corrected DSMC method for the fiow in a two-dimensional 
cavity. The last section contains conclusions. 



3 An Unsteady Flow Simulation 

In order to estimate the practical significance of this correction a two-dimensional 
unsteady fiow in the square cavity with a moving bottom and diffusively re- 
fiecting walls was simulated by the DSMC method both with and without it. 
Simulation domain is shown in Fig. la) , where the linear size of the square was 
equal to 32 mean free paths (mfp) of an initial state. During simulations the 
domain was divided into 9216 square cells with the cell size being equal to one 
third of mean free path and constant moving bottom velocity Uw was equal to 
0.6 of the most probable velocity of the initial state vtj vt = ^2kT/m, where 
k — Boltzmann’s constant, T — initial temperature and m — molecular mass. Time 
step At was equal to 0.04, 0.08 or 0.16 of mean free time (mft) of a molecule. 
Molecules supposed to interact as hard spheres. 





Pig. 1. a) Simulation domain, b) Relative error S for the uncorrected version in com- 
parison with the corrected one. The solid is for y=1.9mfp and the dashed one for 
y=0.4mfp 
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The whole flow fleld was equally decomposited between four processors. After 
completing the simulation of an initial state flrst for a cell it was determined 
number of the potential collisions, which is the expectation of variate and is 
given by the formula of the Bird’s NTC method [2]. Then for those of them which 
were experianced real collision according to that method it was checked whether 
the beginning asymptote of its trajectory with an old velocity had intersected 
the boundary walls. If it was the case then collision is considered still as falsh, 
otherwise it was a real encounter. Then it was checked whether second trajectory 
asymptote has intersected the border lines of the processor or the boundary walls. 
In the last case the apropriate reflections from the walls were simulated. While 
for the former the data of the molecule were written and afterwords transfered 
to the pertinent processor through ’recv’ and ’send’ operations of the MPI. The 
results of the reflections were again checked for intersections and so on. Thus the 
complete trajectories for collided particles were constructed. The probability of 
two collisions for the same molecule, as it has been already above mentioned, was 
neglected. So that the probability of a second collision for the molecules, which 
have had a falsh encounter because of wall perturbation, after that time point, 
was neglected too. Now after all the collided molecules have been moved over 
their real trajectories the remaining in the cell particles are moved according to 
their velocities freely. Completing this procedure for every cell in the flow fleld, 
one has to do only indexing by determining new molecular numbers in cells. 
Thus traditional splitting procedure is entirely avoided. 

The results of the simulations have shown that approximately 6-8% of col- 
lisions are alfected by walls or borders. For instance for a cell near a wall and 
far from the processor borders in a test simulation 6.6% of the whole collision 
number were not realized because their trajectories were perturbed by an en- 
counter with the wall before collision time point tc- For a cell near a processor 
border 7.8% of collisions hapened to occur in that neighboring processor’s space. 
Yet their trajectories were followed up to the very end and then its data were 
transfered to the pertinent processor. And Anally for a cell located exactly at the 
corner between a wall and a border 7.7% of collisions were not realized and 7% 
occured in the dilferent processor’s domain. Now going over to gas dynamical 
quantities it is usefull to consider circulation over rectangular loops C{y) for the 
averaged in a cell molecular velocity V related to the value of Uw and divided 
by their lengths L, Q{y), 



with dilferent distances y from the moving wall to the nearest part of the contour, 
see Fig. la). 

Circulation Q{y) is depicted at the parts a) and b) of the Fig. 2, corresponding 
to At = O.Oimft and At = O.l&mft respectively. The dashed and solid curves 
represent the uncorrected and corrected versions with the dilferent distances 
from the moving surface y: the curve 1 for t/ = OAmfp, the curve 2 for y = 
1.9mfp and the curve 3 for t/ = A.9mfp. As one can see from the Fig.2 the 
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uncorrected version showes higher Q(y) values. It could be easily understood 
by taking into consideration that the perturbations from the moving wall are 
being transfered through the collisions of the particles. Starting in this version 
from the very beginning of the space motional time step At already with its 
new velocities obtained after a collision, which in real evolution is distributed 
somehow during the whole that interval, the molecules are thus transporting 
perturbations ahead of the real evolution. 





Pig. 2. Circulation Q(y): a) for At = 0.04mft and b) for At = 0.16mft, curves 1 for 
y = 0.4mfp, curves 2 for y = 1.9mfp and curves 3 for y = 4.9mfp. Dashed and solid 
curves are uncorrected and corrected versions respectively. 



Some exceptions in the Fig. 2a) for the curves 3 could be explained by en- 
hanced scattering in that case due to insufficients both the sample size and the 
time interval of simulation in the case where the most part of the contour is 
unreachable during that interval. The difference between these versions has ap- 
preciably increased with enlargement of time step At. This is seen by comparing 
parts a) and b) of the Fig. 2 which presents the same quantities but At in the 
last case being four times as higher as in the previous one. 

On the part b) of the Fig.l it is depicted relative error S = (Qun./Qc. — 1) * 
100%. The lower lying curve corresponds here to y = O.imfp and much higher 
values are belong toy = 1.9m fp thus confirming the importance of the distance 
for the accumulation of an overall error. 

4 Conclusions 

The results presented in previous sections show that the trajectory correction 
recovers the joint development inside At of the decoupled in DSMC processes 
by properly mapping the collision events onto subsequent space motional time 
step. The uncorrected version, on the contrary, accelerates the whole evolution 
locating all collisions at one time point at the beginning of the space motional 
step instead of distributing them somehow within it. As natural consequence of it 
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the error in time of the Bird’s algorithm increases asymptotically proportionally 
to the order 0(At). The intoduced in the paper trajectory corrected version has 
this order equal to 0{At^). The values of the time step At used and the mean 
molecular collision number on the distance from the source of perturbation to a 
measuring point are important parameters of the DSMC method related to its 
accuracy. 
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Abstract. In the paper the new approach to design parallel algorithms 
for the modelling of the multi-scale non-stationary processes is proposed. 
Our technique is based on the explicit multi-level difference schemes with 
the local stability conditions. We study a number of the methods which 
are realized efficiently with multi-computer systems and are applied to 
some problems from combustion theory. 



The advantages of the explicit methods from the parallelisation point of view are 
well known. But very stiff stability conditions were the reason to exclude such 
algorithms from computational practice, essentially for the diffusion problems. 
From the other hand, there are many examples where the use of implicit schemes 
are failed. For instance, implicit methods for the flame propagation problems 
require the discretization time step, which coincides with the step of the stable 
explicit scheme. The reason of this effect is in local unstability in the small 
subdomain with combustion process. It means that the large time step in implicit 
scheme does not provide an acceptable accuracy. 

Thus, we consider a new class of explicit schemes with different time steps in 
space subdomains with local stability conditions. Below we present some meth- 
ods in vector-matrix form. All these algorithms may be considered as domain 
decomposition methods. Our consideration is partially based on the ideas de- 
scribed in [1-4], From a variety of the existing methods of parallelization of 
algorithms, the algorithms considered here are parallelized by the domain de- 
composition method and by the explicit form of schemes. Briefly, the essence of 
this method consists in the following. The basic data of a problem are distributed 
among nodes (branches of a parallel algorithm), and the algorithm is the same 
in all the nodes, but operations of this algorithm are distributed according to 
the data, available in these nodes. The distribution of operations of an algorithm 

* The work was suppoted by the RFBR (grant 01-01-00819), the Programm ’’Russian 
Universities” (grant 991116), the Russian-Holland Programm NWO-RFBR (grant 
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consists, for example, in assignment of different values by a variable of the same 
cycle in different branches, or in performance in different branches of the differ- 
ent number of loops of the same cycles, etc. Homogeneous distribution of data 
among nodes (branches) serves a basis for the balance between the time needed 
for calculation, and the time needed for interactions of branches. 

Let all unkown variables be devided into two groups. Then the matrix, coor- 
responding to the diffusion grid operator, is presented in a block form 



A 



All Ai2 'I 

A'i 2 A 22 ) 



The two-level scheme of Dirichlet type is following: 
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where At = rriT, n = 0,1, , k = 0, ... ,m — 1. Here the vector u” corre- 

sponds to the variables in the ’’external subdomain” , U 2 ™ corresponds to the 
variables in the ’’internal subdomain” and equalities (2) are the linear interpo- 
lation in the interface. It means that in the ’’internal subdomain” we solve the 
Dirichlet problem. The scale difference is provided by strong inequality 



||Hii|| <C 1+2211- 

The main theoretical result is localization of stability conditions: 



Zit|+ii|| = 0(l), r|+22|| = 0(l). 

Accuracy of scheme (f)-(3) is 0[At) in ’’external subdomain” and 0[t) 
in ’’internal subdomain”. Now we will present the scheme with an accuracy 
0[[Aty‘) in ’’external subdomain” and practically without additional arithmeti- 
cal costs. Here we used the result from article [5]. Let wf be auxiliary vectors 
and Wj’ = u\. Then we consider the following scheme: 



^ ^ + Aiiv'i + H12+ = /r, 

I h I k — 1 

ttH ttH 

-7 / w -7 / w . k — 1 \ k — 1 \ k — 1 

“2 “2 I aT I A 

h Ai 2 'i^i + A22U2 — f 2 : 

T 



^ + ^ U2+U‘. 



A ~ “1 Ai2 - ^ 



71+1 



(4) 

(5) 

( 6 ) 



At 



2 



2 



^(/r+/r+'), (7) 




444 Yu.M. Laevsky et al. 



where k = 1 , . . . , m. More accurate variant of such type schemes is to use the 
corrector in ’’internal subdomain” too: 
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In many combustion problems there are more than two scales. For instance, 
in the gas filtration problem at least three essentially different time-space scales. 
For such problems we propose the Neumann type multi-level scheme. All unkown 
variables are devided into p groups and now the matrix, coorresponding to the 
diffusion grid operator, is presented in a block-tridiagonal form. Then let 
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where Ai i are the diagonal blocks and A^^\ 
block matrices 
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Now we have p different scales and corresponding condition is 



||Ai|| <C ••• <C ||Ap 



And finally, let us introduce some index notations. Let m 2 , . . . , m^ be some 
natural numbers. Then ri = 1, r; = m;r;_i and ai = n, a\ = o;;_i -\nijri, I = 

2. . . . ,p. The values ai are the functions of the parameters = 0, . . . , mk, k = 

2. . . . , /. Let us consider the following multi-level difference scheme: 
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Uk = 0, . . . ,mk - I, k = 2,...,l, l = 2,...,p-l, 
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where the sequence of the time-steps are given by the equalities 
Ti = At, = l = 2,...,p. 

It is easy to see that in distinguish to schemes (l)-(3), (4)-(7) and (8)-(12) here 
we do not use interpolation on the interface and in the ’’internal subdomain” we 
solve the Neumann problem. Particularly two-level scheme has the form 
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where A; = 0, . . . , m — 1, m = m 2 . For scheme (13)-(15) we have proved a 
localization of stability conditions: 

Ti\\Ai\\ = 0{1), l = l,...,p. 

Now we will present Dirichlet-Neumann type algorithm with the other ad- 
joint condition on the interface. Namely we use adjoint condition based on the 
penalty method. This algorithm approximates some auxiliary problem with dis- 
continues solution, and a convergence to the solution of the original problem 
is provided by small positive parameter e. For simplicity we present two-level 
variant of the method. Firstly let us devide all variables into three groups: the 
variables in the opened ’’external subdomain”, the variables in the interface and 
the variables in the opened ’’internal subdomain”. Then diffusion grid operator 
has the form 
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In accordance with these notations we consider two groups of the variables. Let 
us note that we include the variables on the interface into the both groups. Then 
the explicit scheme of the penalty method has the form 
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corresponds to the internal Newton type boundary conditions in differential 
problem: 

dui dui du2 

e- \-Ui-U2=0, h w— =0 on B. 

oui oui 01/2 

The theory of a convergence with respect to the small parameter e is based on 
the estimate 

||m — Ue\\L2{Q) = 

where u and are the solutions of the original and the perturbed problems, 
respectively. 



References 

1 . V.l.Drobyshevich and Yu.M. Laevsky. An algorithm of solution of parabolic equa- 
tions with different time-steps in subdomains. Rus. J. of Numer. Anal, and Math. 
Modell. - 1992. - V.7, No.3. - R205-220. 

2. Yu.M. Laevsky. The decomposition of domains for parabolic problems with discon- 
tinuous solutions and the penalty method. Comp. Maths Math. Phys. - 1994. - 
V.34, N 0 . 5 . - P.605-619. 

3. V.D. Korneev and S. A. Litvinenko. The domain decomposition parallel algorithm 
for multi-dimensional parabolic equations. Bull, of the Novosibirsk Computing 
Center. - Ser. Computing Science. - 1999. - Is. 10. - P.25-35. 

4. Yu.M. Laevsky and P.V.Banushkina. The compound explicit schemes. Siberian J. 
Numer. Math. - 2000. - V.3, No. 2.- P.165-180 (in Russian). 

5. G.V.Demidov and E. A. Novikov. Effective algorithm for the integration of non- 
stiff systems of ordinary differential equations. Numerical Methods in Mathemat- 
ical Physics. - 1979, Novosibirsk. - Computing Center of SB RAS. - P.69-83 (in 
Russian) . 




Tool Environments in CORBA-Based Medical 
High Performance Computing*^ 



Thomas Ludwig, Markus Lindermeier, Alexandros Stamatakis, and 

Gunther Rackl 

Technische Universitat Miinchen (TUM), Informatik 
Lehrstuhl fiir Rechnertechnik und Rechnerorganisation (LRR-TUM) 
Arcisstr. 21, D-80333 Miinchen 
{ludwig , linderme , stamatak , rackl}® in . turn . de 



Abstract. High performance computing in medical science has led to 
important progress in the field of computer tomography. A fast calcula- 
tion of various types of images is a precondition for statistical comparison 
of big sets of input data. With our current research we adapted parallel 
programs from PVM to CORBA. CORBA makes the integration into 
clinical environments much easier. In order to improve the efficiency and 
maintainability we added load balancing and graphical on-line tools to 
our CORBA-based application program. 



1 Introduction 

Imaging in medical science is an important issue that shows an increasing con- 
nection with high performance computing. Relevant picture series from imag- 
ing hardware like magnetic resonance tomographs or positron emission tomo- 
graphs are usually computed on powerful servers and stored in specialized picture 
archiving systems. 

Recently, workstation clusters became more and more popular as they pro- 
vide a good price-performance ratio. Furthermore, many operations that are 
performed on these picture series exhibit a maximum parallelism. In many cases 
no interprocess communication is required and the parallelization is handled at 
the granularity level of the individual pictures. 

As soon as the parallel imaging servers are used in production mode we 
are faced with two more problems. One is the load of the individual nodes of 
the cluster. It should be balanced in order to guarantee an optimal use of the 
computational power of the cluster. Second, the imaging software has to interact 
with other software components in a medical environment and thus has to meet 
certain standards of reliability and interoperability. 

The paper will present an approach where we base our parallelization of the 
imaging software on a distributed object-oriented middleware system (in our case 
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Load Management System 




Runtime Environment 



Fig. 1. The components of a load management system 



CORBA) to take advantage of component integration. A load balancing mecha- 
nism is integrated into a specific CORBA ORB to provide optimal performance 
to the application programs. 

2 The Load Management System 

Load management systems can be classified according to their implementation. 
They may be integrated into the application, the runtime system, or a separate 
service. The first case is called application level, the second one system level, 
and the third one service level load management. We decided to make a system 
level implementation because it provides maximum flexibility and transparency 
to the user. 

In general, load management systems can be split into three components: 
The load monitoring, the load distribution, and the load evaluation component. 
They fulfill different tasks and work at different abstraction levels. This eases 
the design and the implementation of the overall system. Figure 1 shows the 
components of a load management system and a runtime environment containing 
some application objects. 

The load monitoring component provides both, information on available com- 
puting resources and their utilization, and information on application objects 
and their resource usage. This information has to be provided dynamically, i.e. 
at runtime, in order to obtain knowledge about the runtime environment and its 
objects. The computing resources in distributed environments may be shared by 
middleware based applications and legacy applications. 

Load distribution provides the functionality for distributing workload. Load 
distribution mechanisms for system level load management are initial placement, 
migration, and replication. 

Initial placement stands for the creation of an object on a host that has 
enough computing resources in order to efficiently execute an object. Ini- 
tial placement may be applied to all kinds of objects because it is done at 
creation time. 
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Migration means the movement of an existing object to another host that 
promises a more efficient execution. It may be applied to all kinds of objects, 
too. However, migration is applied to existing objects, so the object state 
has to be considered. The object’s communication has to be stopped and its 
state has to be transferred to the new object. Finally, all communication has 
to be redirected to the new object. 

Replication is similar to migration but the original object is not removed, so 
some identical objects called replicas are created. Further requests to the 
object are divided up among its replicas in order to distribute workload 
(requests) among the replicas. Replication is restricted to replication safe 
objects. This means that an object can be replicated without applying a 
consistency protocol to the replicas. A precise definition of the term replica- 
tion safe can be found in [7]. 

Finally, the load evaluation component makes decisions about load distribu- 
tion based on the information provided by load monitoring. The decisions can 
be reached by a variety of strategies. The aim of the diverse strategies is to 
improve the overall performance of the distributed application by compensating 
load imbalance. There are two main reasons for load imbalance in distributed 
systems. First, background load can substantially decrease the performance of 
a distributed application. Second, request overload that is caused by too many 
simultaneously requesting clients increases the request processing time and thus, 
decreases the performance of the overall application. Both sources of load im- 
balance have to be considered by a load management system. 

Distributed object oriented environments like CORBA [10] or DCOM [2] are 
based on some kind of object model. In general, the object models imply some 
transparency requirements [8]. Location transparency demands that the loca- 
tion of an object is unknown to its user. The middleware transparently connects 
client and server. Access transparency postulates that all objects in a distributed 
system are accessed in the same way. The middleware is responsible for providing 
uniform access to all objects, independent of their implementation or runtime 
environment. These transparency requirements have to be fulfilled by load man- 
agement systems, too. Therefore, load distribution has to be transparent to the 
user. Our load management system provides full migration and replication trans- 
parency which means that migration and replication are completely transparent 
to the user. 

The load management concepts described so far are universal and may be 
applied to diverse distributed object-oriented environments. The implementation 
of these concepts strongly depends on the underlying middleware architecture. 
We decided to make an implementation for CORBA because it is the most 
popular middleware architecture. 

In CORBA, objects are connected to the middleware by the BOA (Portable 
Object Adapter). The object adapter provides the functionality for creating and 
destroying objects, and for assigning requests to them. The POA is configured 
by the developer via so called policies. The ORB (Object Request Broker) pro- 
vides the functionality for creating object adapters and for request handling. A 
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request to an object arrives at the ORB which transmits it to the appropriate 
POA. Subsequently, the object adapter starts the processing of the request by 
an implementation of the object (Servant). 

The load management functionality, especially load monitoring and load dis- 
tribution, have to be integrated into the ORB and the POA because we decided 
to make a system level implementation. Therefore, we added some policies and 
interfaces to the POA in order to enable state transfer and the creation of repli- 
cas. The monitoring of the runtime environment is performed via the Simple 
Network Management Protocol (SNMP) [If] which is a well established stan- 
dard in network management. 

A new policy called ControlFlowPolicy that controls the creation and de- 
struction of CORBA objects is added to the POA. The policy value USER indi- 
cates that objects are created by the programmer. The value SYSTEM indicates 
that objects are created on demand by the CORBA runtime environment. This 
enables the transparent creation of new objects in case of migration and repli- 
cation. Therefore, the programmer has to provide a ServantFactory interface 
that enables the creation and destruction of Servants analogous to the Factory 
design pattern [4]. The POA’s RequestProcessingPolicy is extended with the 
value USE_SERVANT_FACTORY that causes the POA to use the ServantFactory 
for object creation and destruction. 

Migration and replication of objects that hold state require state transmis- 
sion as described before. Therefore, some persistence mechanism has to be pro- 
vided. A new policy, the PersistencePolicy is added to the POA. The pol- 
icy value USE_PERSISTENT_SERVANT_FACTORY indicates that an extension of the 
ServantFactory interface, the PersistentServantFactory, is used in order to 
create and destroy objects. Additionally, the PersistentServantFactory pro- 
vides the functionality to extract an object’s state and to recreate objects from 
that state. This approach enables the application of various persistence mecha- 
nisms like the Persistent State Service [9] or proprietary mechanisms like Java 
serialization. 

Finally, request redirection is performed by the CORBA Location Forward 
mechanism [5]. It enables to hand over object references to clients by raising an 
ForwardRequest exception. The client runtime transparently reconnects to the 
forwarded reference. This guarantees migration and replication transparency. 

3 The Medical Image-Processing Application 

A medical image-processing application is chosen for exploration of concept pur- 
poses. The realignment process forms part of the Statistical Parametric Mapping 
(SPM) application developed by the Wellcome Department of Cognitive Neurol- 
ogy in London [6]. SPM is used for processing and analyzing tomograph image 
sequences, as obtained for example by functional Magnetic Resonance Imaging 
(fMRI) or Positron Emission Tomography (PET). Such image sequences are used 
in the field of neuroscience, for the analysis of activities in different regions of 
the human brain during cognitive and motoric exercises. 
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Realignment is a cost intensive computation performed during the prepara- 
tion of raw image data for the forthcoming statistical evaluation. It computes a 
4x4 transformation matrix for each image of the sequence, for compensating the 
effect of small movements of the patient, caused e.g. by his breath. The images 
are realigned relatively to the first image of the sequence. 

The realignment algorithm for image sequences as obtained by fMRI will 
briefly be presented. One has to distinguish two cases. 

First Crise: Realignment of one sequence of images: The reference data set 
and the first matrix is obtained by performing a number of preparatory 
computations using the image data of the first image. The matrices for all 
remaining images are calculated using the reference data set. 

Second Case: Realignment of multiple sequences of images: The reference data 
set and the first matrix of the first sequence are calculated. Thereafter, the 
first images of all remaining sequences are realigned relatively to the first im- 
age of the first sequence and its reference data set. Finally, the realignment 
algorithm as described in the first case is applied to all sequences indepen- 
dently. 

At this point the only precondition for the calculation of the transformation 
matrix is the availability of the reference data set, which is calculated only once 
for each sequence. Once the reference data set(s) is(are) available, the matrices 
of the sequence(s) can be computed independently. 

The manually parallelized realignment application is already available as se- 
quential C++, C++/CORBA and C++/PVM program. Previous work shows, 
that the overhead induced by CORBA is not prohibitive for its deployment in 
clinical environments. 

For the following steps it is necessary to transform the sequential C++ pro- 
gram into a Java program because some components of our tool environment 
only provide Java interfaces. This program transformation is performed using 
the Java Native Interface (JNI). An interesting intermediate result is that the 
deployment of JNI does not lead to any performance decrease for the specific 
program [12]. 

4 Integrating the Application into the Tool Environment 

In order to improve performance and scalability of the image-processing appli- 
cation we decided to integrate it into our load management system. 

As already mentioned in section 3 the availability of a Java program is a nec- 
essary prerequisite for the integration into the load management system, since it 
only provides services for Java/CORBA programs. The sequential Java realign- 
ment application is transformed into a distributed Java/CORBA application. 

Figure 2 depicts the structure of the CORBA application. The service offered 
by the server object is the compute () service, which calculates the transforma- 
tion matrix for an image. The state of a server object consists of a reference data 
queue (cache). Therefore it is replication safe since it can be replicated without 
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Fig. 2. The structure of the medical image-processing application 



applying a consistency protocol to its replicas, i.e. the required cache data can 
easily be reestablished. A getReferenceDataO service is offered by each client 
and provides the specific reference data to the server if it is not already cached. 

The basic adaptation of the Java/CORBA application to the load balancer 
is straightforward. Minor chnages to the code are necessary in order to add the 
ServantFactory and PersistantServantFactory methods to the server object. 
In addition to those modifications the system is extended by various additional 
components for testing particular aspects of the load management system. The 
mechanism itself was integrated into the Java-based JacORB [1]. 

The second part of our tool environment consists of the Middleware Moni- 
toring Tool (MIMO) [3] and the graphical on-line visualization tool MiVis (Mid- 
dleware Visualization). The integration of these tools is straight forward, too. 
MIMO provides some standard events like object creation, object deletion, ob- 
ject interactions, and additionally defines generic events. Furthermore, MIMO 
provides the infrastructure for designing active tools, i.e. tools that manipulate 
the monitored application. Initially we specify the data to be monitored, for 
example client and server hosts, client and server objects, server object load, 
server host load, application object interactions, and load balancing actions like 
migration and replication. This information is provided by a MIMO adapter that 
is used to instrument the application and the load management system. 

MiVis is a graphical on-line visualization tool that is based on the MIMO 
monitoring system. It provides a framework that enables the development of 
new display types which can be plugged into the tool core. We developed a new 
display that is used for the visualization of the new monitoring events described 
before. Figure 3 presents the basic layout of the graphical on-line tool. Client and 
server objects are located within the respective rectangles representing the client 
and server hosts. In addition, server object load (numerical representation) and 
server host load values are depicted (numerical and graphical representation). 
The CORBA method compute () is represented as blue arrow (black in Fig. 3) 
with a counter and getReferenceDataO as offset turquoise arrow. Replications 
and Migrations are represented as yellow (white in Fig. 3) and red arrows re- 
spectively. Replication and Migration actions can be initiated manually too, by 
a drag and drop function. 

The combination of MIMO and MiVis provides a flexible and extensible 
infrastructure for the development and the maintenance of large scale distributed 
applications. Together with our monitoring system performance and scalability 
of applications can be substantially improved. 
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Fig. 3. Visualization of a replication and of object interactions 

5 Evaluation 

In order to evaluate the efficiency of the presented load management concept 
and its implementation, a test case is shown. 

The hardware consists of three machines with equal configuration. There is 
no background load on the machines. The examined CORBA application is the 
medical image-processing application described in section 3 with two simultane- 
ously requesting clients. The application is replication safe as already mentioned 
in section 4. Thus, migration and replication can be applied to this application. 

Figure 4 shows the processing time per image against the number of the 
processed image for both clients. At the beginning, one server object is created 
and placed on a machine (initial placement) and the clients start requesting the 
server. The image processing time is equivalent for both clients now because the 
server alternately processes their requests. After a while the load management 
system recognizes that the server is overloaded because both clients permanently 
request the server. Accordingly, replication is performed, i.e. a second server ob- 
ject (replica) is created and each client gets a replica on its own. In consequence 
of the replication, the image processing time of each client decreases about 50%. 
Some time later background processor load is generated on the machine that 
is used by the second client’s replica. Hence, the image processing time of the 
second client substantially increases. Again, the load management system rec- 
ognizes the processor overload and migrates the affected replica to the third 
machine which was not used so far. The consequence is that the image process- 
ing time returns to its normal level. 




454 



T. Ludwig et al. 



Processing Time / Image [Sec.] 




Fig. 4. The load managed medical image-processing application 



The test case shows how the load management system is able to deal with dif- 
ferent kinds of overload. Request overload is compensated by replication, whereas 
background load is compensated by migrating an object to a less loaded host. 
Consequently, the load management systems improves the performance and the 
scalability of the medical image-processing application. 

6 Conclusion and Future Work 

The combination of load balancing and graphical user interface provides a pow- 
erful environment for the production oriented image processing in medical en- 
vironments. Workstation clusters can be used as high performance servers for 
reconstruction and statistical analysis of tomography pictures. Our CORBA- 
based approach allows the integration of image processing into the workflow of 
clinical routine. Future steps in this field will cover aspects of fault tolerance, 
where the computing environment will have integrated mechanisms for fail-soft 
and recovery. 
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Abstract. In the last few years, molecular biology has produced a large 
amount of data, mainly in the form of sequences, that is, strings over an 
alphabet of four (DNA/RNA) or twenty symbols (proteins). For com- 
putational biologists the main challenge now is to provide efficient tools 
for the analysis and the comparison of the sequences. In this paper, we 
introduce and briefly discuss some open problems, and present a parallel 
algorithm that hnds repeated substrings in a DNA sequence or common 
substrings in a set of sequences. The occurrences of the substrings can be 
approximate, that is, can differ up to a maximum number of mismatches 
that depends on the length of the substring itself. The output of the 
algorithm is sorted according to different statistical measures of signih- 
cance. The algorithm has been successfully implemented on a cluster of 
workstations. 

1 Introduction 

On April 6th 2000, Gelera Genomics announced to the world that the sequenc- 
ing phase of the genome of a human being was completed. These news made 
the headlines all over the world, and even if it was to be taken with a grain of 
salt\ Gelera’s announcement had and is still having a great impact on science, 
religion, and politics. Although important, the news were for many people al- 
ready involved in molecular biology and computational biology just the tip of an 
iceberg. In the last few years, molecular biologists have produced a large amount 
of data, and more are going to come in the near future. For instance, since last 
April, the genomes of Drosophila Melanogaster (fruit fly) [1] and Arabidopsis 
Thaliana (thale cress) [2] have already been completed. Goming up next: the 
mouse genome. 

Biological data come in the form of DNA or protein sequences. The standard 
assumption is that the sequences contain all the information needed to obtain 
biologically meaningful results, abstracting away the reality of DNA and pro- 
teins as flexible three-dimensional molecules interacting with one another in a 

^ If we hgure the genome as a book, the sequencing stage has produced millions of 
pages, alas, with no page numbers on them. The assembly phase, that is, putting 
the pages in the right order, has yet to be completed. 
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Table 1. Average genome sizes. 



Bpstein-Barr virus 
Bacterium {E.colt) 
Beer Yeast 
Nematode Worm 
Thale Cress 
Fruit Fly 
Homo Sapiens 



0.172 X 10® bytes 
4.8 X 10® bytes 
14.4 X 10® bytes 
100 X 10® bytes 
100 X 10® bytes 
165 X 10® bytes 
3300 X 10® bytes 



dynamic environment. For computer scientists, this is a real godsend (at least 
at the beginning, until they find out what problems they have to face), that 
allows them to enter such a fascinating world and at the same time to work on a 
data structure they are very familiar with: the string. When dealing with DNA 
or RNA, strings are built over an alphabet of four symbols, corresponding to 
the four DNA nucleotides. For proteins, we have an alphabet of twenty symbols, 
corresponding to the twenty different amino acids that build them. Leaving aside 
problems related to the generation of the data itself (such as sequence assembly 
problems), there are scores of different challenging problems deriving from the 
analysis of biological sequences, whose solution can provide efficient and powerful 
tools for the biological community (for a complete survey, see [3]). 

Most of the problems admit polynomial time solutions. However, when deal- 
ing with whole chromosomes or even genomes, the size of the data is such that 
even linear algorithms become time consuming. Some figures are shown in Ta- 
ble 1. On the other hand, in some cases a parallel version of the algorithm is 
trivial to implement; therefore, running an algorithm even on a small cluster 
of workstations can yield significant improvements on the time required by the 
sequential version. 

2 Open Problems 

One of the most widely studied problems is finding various types of repetitive 
structures in biological strings. One of the most striking features of DNA (or 
to a lesser degree, proteins) is the number of repeated substrings that occur in 
genomes. It has been estimated that families of reiterated sequences account for 
about one third of the human genome. A short discussion on types and roles of 
repeated structures in DNA can be found in [3]. The main difficulty lies in the 
fact that repeats of the same pattern can be approximate, that is, may present 
mutations, insertions, or deletions of symbols. If we restrict our attention to 
mutations, the problem can be formalized as follows. 

Problem 1 Given an alphabet U, a string S on U, and two integers e and q, 
find all the patterns that occur at least q times in S with at most e mismatches. 

A closely related problem, at least when it comes to its solution, is to find 
common substrings in a set of strings. For example, if some sequences share 
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the same biological function, the common substrings could hint at which parts 
are responsible for the function itself. Again, if we allow only mismatches, the 
problem can be formalized as follows. 

Problem 2 Given an alphabet U, a set of strings Si, ... ,Sk on S, and two 
integers e and q, find all the patterns that occur in at least q strings with at most 
e mismatches. 

Alas, when dealing with biological sequences, things are not that simple. 
For example, in proteins, some amino acids have similar chemical and physical 
properties, while others are significantly different. Therefore, mismatches in the 
strings should be also weighted according to the similarities between pairs of 
amino acids. Moreover, as we already mentioned, each occurrence of a pattern 
could present the insertion and/or deletion of some symbols. Thus, problem 1 
becomes: 

Problem 3 Given an alpha, bet U, a string S on S , an error threshold e € IR, 
an integer q, and a distance measure T>, find all the patterns that occur at least q 
times in S such that, for every occurrence, the distance between the pattern and 
the occurrence measured according to T> is less than or equal to e. 

For example, in proteins distance between amino acids can be measured ac- 
cording to PAM or BLOSUM matrices, that define a distance value for each 
pair of amino acids. Thus, given two strings S'! = s} . . . s/ and S 2 = s\ . . . s\ oi 
equal length, the distance T> between the two strings is the sum of the distances 
between the corresponding symbols sj and sf. When insertions and deletions 
are taken into account, the measure of distance usually used is the edit distance. 
Given two strings (of arbitrary length), the edit distance is defined as the min- 
imum number of edit operations (mutation, insertion or deletion of a symbol) 
needed to transform one string into the other. If we apply Hamming distance 
to Problem 3 we have again Problem 1. Problem 2 can be extended in a similar 
way. 

Clearly, the complexity of the problems we introduced depends on the dis- 
tance measure adopted. If we use Hamming distance, a naive algorithm that 
generates all patterns of length k on A and checks whether they satisfy the 
constraints takes 0(|A|*en) time, where 0[en) time is usually required to find 
the occurrences of each pattern. Some improvements have been introduced for 
the latter, but the main drawback is the |A|* factor, due to the exhaustive enu- 
meration of the patterns. The 0(|A|*en) time bound has been improved in [4], 
where it is reduced to 0(|i7|®A;®n) by means of suffix trees, a data structure 
that exposes the internal structure of a string in a very deep and meaningful 
way. Further improvements can be obtained by introducing some heuristics that 
somehow prune the search space, or that impose some restrictions on the loca- 
tion of the mismatches, for example, forcing them to occur at the same positions 
in each occurrence. The main drawback of heuristic methods is the fact that 
some “interesting” patterns can be missed altogether. We believe that heuristic 
approaches are suitable to perform a quick analysis of the data, while exhaustive 
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enumeration, implemented as efficiently as possible, is perhaps the best choice 
if a thorough once-for-all analysis is needed. 

The algorithm we present in this paper solves Problems 1 and 2, and can be 
seen as an extension of the one presented in [4], In the next section we introduce 
the data structure the algorithm is based on, while in sections 4, 5 and 6 we 
present the algorithm and two different parallel versions. 



3 Suffix Trees 

A suffix tree T for an n-character string S' = . . . s„ is a rooted directed tree 

with exactly n leaves numbered 1 to n. Each internal node, other than the root, 
has at least two children. Each edge is labeled with a nonempty substring of S. 
Two edges leaving the same node cannot have labels beginning with the same 
character. Eor any leaf i, the concatenation of the edge labels on the path from 
the root to leaf i exactly spells the suffix of S starting at position i, that is, it 
spells out Si ... Sn- 

The definition of suffix tree just given, however, does not guarantee that a 
suffix tree exists for every string S . The problem is that if one suffix of S matches 
a prefix of another suffix, then the tree cannot be built, since the path for the 
first suffix would not end up in a leaf. This problem can be avoided by assuming 
that the last symbol of the string does not appear elsewhere in the string, i.e., 
by appending to the string a termination symbol that does not belong to the 
string alphabet, as shown in Eig. 1. 



X a b X a $ 




Fig. 1. Suffix tree for string xabxa. Symbol $ is used as termination. Without the 
termination, suffix xa would not end up in a leaf, since it is also a prefix of xabxa. 



A suffix tree for a string can be built with different methods [5,6] in time 
linear on the length of the string. As a matter of fact, the only linear time 
solution for many string problems can be obtained only by using suffix trees or 
analogous text-indexing structures. It is also straightforward to prove that the 
space required by the tree (depending on the number of its nodes) is 0(n). 
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Given a string S of length n on a finite alphabet S, once the suffix tree for S 
has been built searching for a pattern p in the string is straightforward. Starting 
from the root, we match the symbols of p along the unique path in the tree 
until either p is exhausted, or no more matches are possible. In the former case, 
the leaves in the subtree below the last match are numbered with a starting 
location of p in the string. If we are interested only in counting the number of 
occurrences of patterns, we can also annotate each node of the tree with the 
number of leaves in the subtree below it. This can be done, once the tree has 
been built, with a linear time (since we have 0(n) nodes) depth-first traversal 
of the tree. If the pattern is m symbols long, the search takes 0[m) time, with 
an overall time complexity of 0(n+ m), equaling “classic” pattern matching al- 
gorithms like Boyer-Moore [7] or Knuth-Morris-Pratt [8]. Suffix trees, however, 
are appealing when we want to search for many different patterns in the same 
string, lhaditional pattern matching algorithms pre-process the pattern instead 
of the string and require, for each pattern, 0[m) time for the preprocessing and 
0[n) time for the actual search. This approach might become time consuming 
in practice when n >> m, as in the case of biological sequences. With suffix 
trees, instead, the 0[n) time is required only once, for the construction of the 
structure. Once the tree has been built, searching for each pattern takes only 
0[m) time. Moreover, the theoretical linear time bound for the construction of 
the tree has proven itself to be very efficient in practice. 

We can also search for a pattern p with at most e mismatches in a similar 
way. In this case, we match p along different paths on the tree at the same time, 
keeping track of the number of mismatches encountered on each path. Whenever 
the number of errors on a path is greater than e, we discard that path. If we 
complete p, the surviving paths represent all the occurrences of p in S' with at 
most e mismatches. 

4 The Algorithm 

The starting point of the algorithm is the search method for approximate occur- 
rences of a pattern outlined at the end of the previous section. Given a string S, 
its suffix tree T and a pattern p, we will call locus of p in T the end (along an 
edge) in T of the path corresponding to p. We want to solve a slightly different 
version of Problem 1: given an error ratio e, with 0 < e < 1, we want to find 
all the patterns p that occur at least q times in S with at most |~fc|p|] errors, 
where |p| denotes the length of p. We also assume that a maximum length M of 
patterns to be sought has been given as input. 

First of all, we build the suffix tree T for S, and annotate each node with the 
number of occurrences of the corresponding substring. The core of the algorithm 
is the recursive procedure expand, outlined in Fig. 2. Suppose we have found on 
T the paths corresponding to approximate occurrences of a pattern p = pi . . . p„, 
that is, a set of pointers to the loci on T of patterns whose Hamming distance 
from p is less than or equal to [e|p|]. We will denote with Actualp this set of 
pointers. Also, we have associated with each pointer the Hamming distance from 
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expand(p, s, Actualp, FuturCp) 

1. p' = ps 

2. Occ{p ) = 0 

3. Fut[p') = 0 

4. Actualp/ = 0 

5. Future pf = 0 

6. For all q 6 Actualp U FuturCp do 

7. For all q' 6 Next[q) do 

8. if the last symbol pointed by q' matches s 

9. error (pq') = error (q) 

10. else 

11. error (pq) = error {pq) + 1 

12. end if 

13. if error {q') < |"e|p'|] 

14. add q' to Actualp/ 

15. Occ{pp') = Occ(p') + count [q') 

16. Fut[p') = Fut(pp') + count[q') 

17. else if error [q') < [eM~\ 

18. add q' to FuturOp/ 

19. Fut[p') = Fut(pp') + count[q') 

20. end if 

21. end if 

22. end for 

23. end for 

24. 

25. if Occ{pp') > minocc 

26. report(p') 

27. end if 

28. 

29. if Fut{pp') > minocc 

30. For all a E F 

31. expand(p', (T, Actualp/, FuturCp/) 

32. end for 

33. end if 

34. 

35. retnrn 

Fig. 2. The pseudo-code of the procedure expand. Next(q) returns a set of pointers 
to the endpoints of paths obtained by extending by one symbol the path pointed by 
q; count {pq) returns the number of occurrences of the substring whose path is pointed 
by q] report(p') saves the pattern p , that satishes the input constraints and has to be 
output to the user; minocc is the minimum number of occurrences required. 



p of the corresponding substring. The overall number of approximate occurrences 
of p is given by the sum of the occurrences of the substrings spelled by the paths 
in Actualp, that can be read in the nodes entered by the last edges on the paths. 
Furthermore, we keep another list of pointers (called F'uturCp), corresponding 
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to the paths spelling patterns whose Hamming distance from p is greater than 
[fc|p|] but less than or equal to [eM], This is the list of future occurrences of 
p. That is, p might be the prefix of a longer pattern, that can admit at most 
\cM~\ mismatches. These paths, however, do not contribute to the number of 
occurrences of p. 

Now we try and expand p by one symbol. That is, for each symbol a € U, we 
match a against the next symbol on the paths pointed by Actualp and FuturCp. If 
a path ends just before a node T of the tree, we match a against the first symbol 
on each edge leaving T. Whenever we encounter a mismatch, we increase the 
previous error along the path by one. Otherwise, the error remains unchanged. 
If the new error is less than or equal to |~fc(|p| + 1)], we add the corresponding 
pointer to Actualp/, the set of actual occurrences oi p' = p\ . . . PnCr, if it is greater 
than [e(|p| + 1)] but less than or equal to \cM~\ , we add it to FuturCp/] otherwise 
we discard the path. If the new symbol added to the path is the termination, we 
discard the path as well. At the end, once all the actual and future occurrences 
of p have been checked, we have built the sets of actual and future occurrences 
of p' , and computed its actual number of occurrences [Occ[p')) and the number 
of potential future occurrences [Fut[p')). If Fut[p') is at least q, and the length 
of p' is less than M, we expand p' in the same way, otherwise we continue with 
p moving on to the next symbol in F. Notice that neither p or p' are required to 
occur exactly in S . The algorithm starts by expanding the empty pattern from 
the root of the suffix tree. 



4.1 Sorting the Output 

So far, we have been concerned with the size of the input, and how to develop an 
efficient algorithm. But, if we examine its output, that is, the list of patterns that 
satisfy the constraints given as input by the user, we have to face another issue. 
The output is usually huge, especially when the input sequence is long, and the 
minimum number of occurrences required is low. For example, Saccharomices 
Cerevisiae Chromosome I contains more than 200,000 base pairs. If we ask the 
algorithm to report all the patterns, of maximum length 10, that occur at least 
twice in it with no errors, we obtain 153, 397 patterns. Trying to make some sense 
from the output could turn into a nightmare for the hapless biologist. One possi- 
ble solution is to associate with each pattern a measure of significance, trying to 
reflect as much as possible its biological importance, and to sort the output ac- 
cordingly. For example, we may consider biological sequences as random strings 
emitted by a source according to an unknown probability distribution over the 
symbols of the alphabet. When errors are not allowed a complete survey on this 
topic can be found in [9], and an algorithm, based on suffix trees, that finds 
“significant” patterns according to different measures is presented in [10]. In the 
latter, measures of significance compare in different ways the number of occur- 
rences of a pattern with an expected value. Perhaps, the simplest one is the 
following: 



= Occ{p) — E[Occ(p)j 




Parallel Algorithms for the Analysis of Biological Sequences 463 



where Occ[p) is the number of occurrences of a pattern p in a string, and 
E[Occ(p)] is the corresponding expected value computed according to a given 
distribution of probability. The higher is the value of the more “surprising” 
is the number of occurrences of p. On the other hand, if z\ has negative value, 
p appears less than expected. More sophisticated measures can be defined, like 
the following: 

Occ[p) 

^ E[Occ{p)] 

[Occ[p) — E[Occ[p)])^ 

{E[Occ{p)]) 

[Occ[p) — E[Occ(p)]) 
ar[Occ[p)) 



When approximate occurrences are taken into account, we think that a mea- 
sure of significance should consider not only the number of occurrences of a 
pattern, but also how well conserved the pattern is. A pattern of length k that 
occurs q times with no errors should be considered more significant than another 
pattern, of the same length, that occurs q times as well but with e errors in each 
occurrence. Another factor that should be considered is where mutations have 
occurred. A pattern where mismatches occur, for example, in the central e posi- 
tions should be more significant than another one where mutations are randomly 
distributed, since it can be seen as composed by two perfectly conserved parts. 
We now show how these considerations have been implemented in the algorithm. 

Given a pattern p = Pi ■ ■ .p™, we denote with 'H[p,e) the set of patterns 
within Hamming distance e from p. Let S' be a string over an alphabet U. 
We assume that S has been generated by a random memoryless source with a 
given probability distribution on S. Eor each symbol ai € E, we estimate the 
probability of cij to be generated by the source with the maximum likelihood 
estimator: 



’^s{o'i) = Pr[<7i appears in S] 



count[oi) 

1^1 



( 1 ) 



where count[ai) denotes the number of occurrences of Ui in S. The probability 
IIs{p) that p occurs in S with no errors is therefore given by: 



Ilsip) = Pr[p occurs in S'] = Trs{pi) (2) 

i=l 

Allowing overlaps, the number of occurrences of p in S (denoted by Occs{p)) 
is a random variable with binomial distribution, whose expected value is given 
by: 

E[Occs(p)] = ns{p) ■ (|S| - |p| + 1) 

When approximate occurrences are allowed, that is, we allow at most e errors 
for p, the a priori probability of finding a valid occurrence of p is: 

ns{p,e)= ^ ns{p') 

p'e'H(p,e) 
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The corresponding expected value for Occs{p,e) (the number of occurrences of 
p in S' with at most e mismatches) is given by: 

E[Occs{p, e)] = Rsip, e) • (|S| - |p| + 1 ) 

If we use the value lJs{p, e), however, we lose information on how p actually 
appears in S . Therefore, we compute the expected value of Occs{p, e) according 
to the a posteriori probability: 

ns{p, e) = ^ ns{p') 

p'eOs{p,e) 

where Os{p,e) is the set of patterns in 'H{p,e) that appear at least once in 
S. That is, we sum only the probabilities of patterns corresponding to actual 
occurrences of p in S. In this way, the more a pattern is conserved, the less is 
its probability to occur in S', and the higher is its significance value according to 
the measures defined above. Moreover, computing the a posteriori probabilities 
is straightforward. We just have to add to each pointer we use in the expand 
procedure the probability value of the corresponding path. Thus, the a posteriori 
probability of a pattern p is given by the sum of the probability values associated 
with the pointers in Actualp. When a pointer q is expanded to q', the probability 
of the path pointed by q' can be computed by multiplying the probability of the 
path pointed by q by the probability of the symbol added by q' . In this way, 
whenever a pattern satisfies the input constraints we can compute its significance 
value according one of the measures defined before. The variance of Occ[p) used 
in Z 4 can be approximated by neglecting terms due to overlaps, that is, with the 
variance of the binomial distribution (|S'| — |p| + l)17s(p, e)(l — IIs{p,e)). 

4.2 Time Complexity 

The first step, the construction of the tree, takes 0(n) time. Let M be the 
maximum pattern length allowed and e = [eM]. For each call to the expand 
procedure, there are at most 0(n) different paths to be checked, each one in 
constant time. Moreover, we stop expanding a pattern whenever the number of 
future occurrences is less than q. Thus, the complexity of the algorithm depends 
on the number of patterns that have to be expanded, that can be estimated, 

as in [4], by Yl'i=i (T) (1^1 “ 'J-’he overall time complexity is 

therefore 0(|i7|®M®n). The sorting stage takes 0(i^ log i^) time, where v is the 
number of valid patterns reported by the algorithm. 

4.3 Speedups 

In practice, there are different ways to prune the search space, in order to get 
faster, even if less precise, results. For example, we may ask the algorithm to 
consider only patterns that occur at least once exactly in the sequence. It is suf- 
ficient to check, in the expand procedure, whether there is one path with error 
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zero, corresponding to the exact occurrences of the pattern. This reduces signifi- 
cantly the search space. The risk is to miss completely a significant pattern, that 
has mutated in every occurrence. We can also run the algorithm in prefix mode: 
given a pattern p = pi ■ ■ -Pm and an error ratio e, for every valid occurrence 
p' = p'^. . . p'^ of the pattern we must have: 

Vi G {!,..., m} V{pi...pi,p[...pf) < \ei\ 

That is, an occurrence of the pattern is valid iff it is a valid occurrence for each 
of its prefixes. In this way, we require mismatches to be uniformly distributed 
along the pattern. To implement this, we just have to discard the Futurep list 
of pointers, maintaining only Actualp. The risk is to lose some potentially valid 
occurrences. Anyway, the parameters of the algorithm can be fine tuned in order 
to reduce the probability of missing significant patterns to negligible values [If]. 
Moreover, in this case we need not provide the algorithm with an explicit value 
for the maximum length of the patterns. 

5 Extensions 

We now briefly sketch how the algorithm has been extended in order to solve 
Problems 2 and 3. For the common substrings problem on a set of k strings, 
we build a generalized suffix tree, as in [3], where each node is annotated with 
a A; -bit string. The substring spelled out by the path from the root to a given 
node occurs in the i-th string of the set iff the i-th bit of the node is set. This 
can be done with a 0[kn) time pre-processing of the tree. During the pattern 
searching phase, instead of summing the counters of the nodes, we OR the bit 
strings, obtaining two bit strings for the actual and future occurrences of the 
patterns. Then, instead of checking the actual and future counters, we check how 
many bits are set in the two bit strings. If there are more than q bits set (actual 
or future), we expand the pattern. Significance measures based on the number 
of sequences a pattern appears in and the corresponding expected value can also 
be defined, as in the single sequence case. 

The algorithm can also be run with different error measures. The basic version 
simply gives an error value of +1 to mismatches, and 0 to matches. These values 
can be redefined by providing the algorithm with appropriate scores for every 
pair of symbols, as introduced in the discussion of Problem 3. Then, instead of 
defining a maximum number of mismatches, we have to define an error threshold 
T , possibly as a function of the pattern length. 

6 Parallel Implementation 

For the parallel version of the algorithm, we considered two possible alternatives: 
dividing the set of patterns among the processors, assigning a different subset to 
each one, or breaking the sequence(s), making each processor work on a different 
region. Both versions have been implemented on a cluster of five workstations, 
using the Message Passing Interface (MPI) library. 
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6.1 Dividing the Patterns 

As we have seen, the algorithm is composed of three parts: construction of the 
suffix tree, pattern discovery, and sorting of the output. Leaving aside for the mo- 
ment the theoretical time complexity of the three stages, we have noticed that, 
in practice, the most time consuming part is the second one, that is, finding the 
patterns. As a matter of fact, building the tree (pruned to the maximum length 
of the patterns sought), also with a non-optimal implementation of the construc- 
tion algorithm requires at most a few seconds even for sequences of millions of 
base pairs. Moreover, sorting thousands of patterns requires a few seconds. Fi- 
nally, the annotated suffix tree, pruned at the maximum length needed, usually 
fits with no problems into the main memory of a medium size workstation, leav- 
ing enough room for the data structures required by the matching and sorting 
stage. All these considerations have led us to implement one parallel version of 
the algorithm as follows. 

First of all, each processor builds its own copy of the suffix tree, pruned at a 
maximum pattern length given by the user. Then, each processor starts to search 
for his set of patterns. That is, patterns are distributed among the processors 
in order to obtain a workload as balanced as possible. For example, suppose we 
have as input a DNA sequence where the four symbols have (approximately) the 
same frequency, and we want to run the algorithm on four processors. The first 
processor scans its own copy of the tree for all the patterns starting with A, the 
second one for those starting with C, and so on. If we have eight processors at 
our disposal, the first one will search for patterns starting with AA and AC, the 
second one for those starting with AG and AT, and so on. 

This simple heuristic can be extended also to the case of non-uniform prob- 
ability distributions over the symbols of the alphabet. Again, we suppose we 
are working on the DNA alphabet C = {A,C,G,T}, and we want to run the 
algorithm on P processors. First, we generate all the strings on S of length k in 
lexicographic order. Let ,jp‘ , . . . be the strings. For each of them, we also 
compute the probability ]Js{p'') to occur in S according to Equations 1 and 2. 
Then, processor number one searches for patterns starting with ]T , and so 
on, until the sum of the probabilities of the prefixes used equals or is greater 
than 1/P. At the same time, processor number two scans the list of fc-letter 
patterns until it finds the first pattern p* such that ^s{'P) A 1/P. Then, 

it starts searching for patterns starting with p*, p*+^, and so on, until it meets a 
pattern pf such that TlsijP) > 1/P- In the same way, processor number I 

will search for the first p* such that ns{p^) > (^ ~ 1)/-P> proceed 

in the same way. 

After completing the searching step, each processor sorts the patterns found 
according to the measure adopted. If the workload has been balanced correctly, 
we expect all processors to complete this step more or less at the same time. Note 
that no communication among processors has been needed so far. At this point, 
all processors communicate to the same processor their sorted list of patterns, 
together with the significance values. The lists are finally merged together into 
a single sorted list, that is output to the user. 
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Fig. 3. Sequence splitting between two processors. Two sequences Ai = s]; . . . sjy and 
S2 = s\. . . s% of equal length N have been given as input, as well as the overlap 
parameter o. Processor 1 hnds common patterns in the two sub-sequences . . . Sjv/2+0 
and si . . . s^/2+0) while processor 2 works on sjv/2-0 • • • s^^2-o • • • 



6.2 Breeiking the Sequences 

This approach has been considered for the common substrings problem. In DNA 

sequences that share the same biological function, common substrings usually 

appear in the same order in each sequence. Therefore, if a common pattern 

occurs at the beginning of a sequence, we expect it to occur at the beginning 

of every other sequence. The idea is thus the following: each processor finds 

common substrings among corresponding regions of the sequences. For example, 

suppose that two strings Si = s} . . . and S 2 = si . . . have been given as 

input. If the algorithm runs on two processors, we make processor 1 work on the 

first halves of the two strings, that is = s} . . . and S 2 = si . . . s1 ^ , while 

2 2 

processor 2 will search for common patterns in sL ■■■ si, and sj^ • • • sH . 

To avoid missing “interesting” patterns that occur near the middle of the 
sequences, each half is extended by a region that overlaps with the other, whose 
size can be given as input to the algorithm. An example is shown in Fig. 3. 
This idea can be easily extended to more than two processors. This approach 
is appealing when the strings to be processed are very long, for example whole 
chromosomes from different organisms. Notice that no communication among 
processors is needed: each one works by itself, and outputs its list of patterns. 

7 Conclusions 

We presented an algorithm that finds repeated patterns in a string or common 
substrings in a set of strings, suitable for the analysis of DNA and RNA se- 
quences. Pattern occurrences can be approximate, that is, can present a number 
of mismatches that depends on the pattern size. The set of patterns that satisfy 
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the input constraints is output by the algorithm sorted according to different 
significance measures. A parallel version of the algorithm has been easily im- 
plemented on a cluster of workstations. Furthermore, we hinted at some ways 
to speed up the execution of the algorithm. We think that the algorithm, im- 
plemented even on a small number of computers, can provide to the biological 
community a useful tool for sequence analysis. 
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Abstract. The use of M independent computational processors by dis- 
tributing random samples among them decreases the cost of the Monte 
Carlo method by M times, as the final summation and averaging of the 
results are practically inessential. This approach is especially effective 
when using the ‘double-randomization’ method for solving the problems 
with random parameters. When M is large, the necessary amount of 
random numbers is also very large, and it is especially expedient to use 
the combined random-pseudorandom secuence. For global estimating a 
solution in the metric C by simulation of series of trajectories from dif- 
ferent points, it is reasonable to use the same random numbers for each 
point. The fact decreases the necessary amount of random numbers. 



1 Introduction 

It is obvious that the use of M independent computational processors by dis- 
tributing random samples among them decreases the cost of the Monte Carlo 
method by M times, as the final summation and averaging of the results are 
practically inessential. Realization of different sample sizes by different proces- 
sors is admitted, but here it is expedient to use the optimal averaging formula: 




where is the sample size for the i-th processor and Xj is the corresponding 
mean. 

The massive distribution of random samples is extremely effective for the 
Monte Carlo solution of problems with random parameters because ‘double ran- 
domization’ (see Sect. 4) results in the essential increasing of the probabilistic 
space dimension. 

When M is large, the necessary amount of random numbers is also very 
large, and it is especially expedient to use the combined random-pseudorandom 
sequence considered in Section 2. 

This work is mainly related to solving of linear and weakly nonlinear integral 
and differential equations by simulation of the proper random trajectories [f-3]. 
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Note that for global estimating a solution in the metric C by simulation of series 
of trajectories from different points, it is reasonable to use the same random 
numbers for each point (see Section 3). The fact decreases the necessary amount 
of random numbers. 

In conclusion of this section note that there is not the ideal parallel algorithm 
for simulation of a stochastic ensemble of JM interactig particles. But usually here 
the asymptotical deterministic error is equal to CiN^^ and the corresponding 
probabilistic error is equal to ( 72 ^^ 2 . Therefore it is expedient to realize this 
simulation independently (C' 2 /C'i)^N times by different processors with final 
averaging. 

2 Specific Simulation of Ramdom Numbers 

As a rule, simulation of a random variable with a given distribution is carried 
out by transformations of one or a number of independent values of a random 
number a uniformly distributed in the interval (0,1), i.e. by the formula: ^ = 

The sequence of ‘sample’ values of a is usually obtained on a computer by 
number-theoretic algorithms, of which the most widely used is the so-called 
‘method of residues’, in the form 

Mo = 1, M„ = M„_iM(mod2’'), q;„ = m„-2^’'. 

Here r is the order of the mantissa of the computer. Often M = 5^^+^ is used 
[1-3], where 

p = max{(/ : 5^"^+^ < 2’’}. 

Numbers of this type are called ‘pseudo-random numbers’; they are verified 
by statistical testing and by solving typical problems (see [1-3]). The length 
of the period of the above version of the method of residues is 2’’^^. Physical 
generators, tables of random numbers and quasi-random numbers are also used 
in the Monte Carlo method. 

The following special order of using pseudo-random numbers is expedient 
to correlate different computations. It is related to conventional methods of 
verifying the multidimensional uniformness. The sequence {m„} is supposed to 
be divided into the subsequences of the length m, beginning with the numbers 
A; = 0, 1, 2, ..., and each subsequence is used to construct the corresponding 
random trajectory. Clearly, 

(mod 2 ). 

So, to simulate the fc-th trajectory we use the multiplicative pseudo-random 
sequence beginning with 

^km ^km * 2 

Here, it is sensible to use ‘real’ random numbers instead of akm- This com- 
bined method has the theoretical basis provided M — ^ oo [4] (see so [3]). 
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Real random numbers can be produced by physical generators. It is possible 
to improve their distribution by summarization modulo one (congruent summa- 
tion), i.e., using the expression 




where {di} are numbers from a physical generator. It is known that the distri- 
bution of a very quickly converges to uniform distribution, if n increases. 

More, the termwise congruent summation of the random numbers produced 
by independent (or weakly dependent) random generators is very efficient, as 
the next statement shows. 

Let {Pi”^(-)} be the probability density functions in [0, 1]” corresponding to 
the independent random number generators, ‘convolution’ corresponds to ‘sum- 
mation’. 

Theorem 1. If for i = l,2,...,m, m > 2, the distribution densities of 

independent random, vectors in [0,1]” are square integrable and p|-))^^(-) is their 
congruent m-fold convolution, then 

m 

lb(l^)(-) - i||l~ < n 

□ 

Proof. The Theorem is the direct collorary of the Theorem A.l from [3]. □ 

When simulating trajectories, these ‘expensive’ real a are used as initial 
numbers for the method of residues with M as large as possible. It seems that 
this combined method is the most promising when using many processors. 

Additionally remark, that during approximately 35 years the version of the 
method of residues with M = 5^^ and r = 40 was successfully used for solving 
different mathematical physics problems. The numerical results of the statistical 
testing of this version are presented in [1]. 

Similar positive results were obtained for the above-mentioned special order 
of using these numbers with m = 1024. 

3 Global Estimates of Solutions 

In order to construct a global estimate of the function 

(f{x) = J g{x,y)P{dy) 

in the bounded domain D, one can estimate it’s values in nodes of a rectangular 
grid with step H by the Monte Carlo method and then perform linear filling. 
We denote the estimate thus obtained by (f{x). Generally speaking, the function 
(f>{x) is a random field, whose distribution is due to the sample size (i.e. the 
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number of realizations of the Monte Carlo estimates), to the way of constructing 
the estimate and to step h. 

The problem on the convergence of cp to cp in some metric can be solved by 
considering the quantity 

B{p,<p) = E\\p{x) - <p{x)\\l(dp 

where L[D) is a corresponding Banach space (see, for example, [3]). This problem 
is connected with the complexity of the estimation algorithm, i.e. with the aver- 
age number of operations that provides the validity of the inequality B{p, (p) < 6. 
Estimating B{p,p) is the simplest for the space L 2 {D): 

B‘^{'ppp) = J^Ip{x) - <p{x)fdx^ ^ j 

< Id Eb(a;) - p(x)]^dx = Dp(x)dx + - p{x)fdx. 

In particular, expression (1) confirms the importance of the uniform mini- 
mization of the value D.f. We assume that the second-order derivatives of the 
function p{x) are uniformly bounded in D. Then 

B‘^[p, (p) < d/n + C()h^. 

Therefore the problem of minimizing the complexity can be here formulated 
in the form: 

So = ntoh^^ — ^ min, din + cq/j'^ = 

n,h 

where k is the phase space dimension, n is the sample size, h is the grid step, 
the meaning of d is seen from (1), and to = tmes(D), where t is the input for 
one realization of The optimal order of values hpn and So is as follows: 

h*oxS^/\ n*ox6-\ S'o* X 

In this case the estimate B[p, p) < 6 is valid and therefore p converges to p 
in the metric L 2 . 

One can obtain similar results for the metric G[D) by using the theorems of 
embedding the space W^^D) in the space C[D), provided 21 > k. This condition 
implies that using the first-order derivatives suggests k = 1, i.e. allows one to 
consider the convergence of the estimate of the solution of a one-dimensional 
equation (or a multi-dimensional equation in a given straight line). In this case 
the inequality holds 



\\p-p\\l: < K 



[p{x) — p[x)f‘dx 



I [p' {x) — p\x)l^dx . 
u ^ 



Since the variance of the difference of independent random quantities is equal 
to the sum of their variances, one must use the following inequality when making 
an independent estimate of values at nodes of the given grid: 



V)p\x) < 



d[x) 

nh? 
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In addition, since (p' is a step function, one needs to consider the relation 
f ['^{x) — '■p'[x)]'^dx < C[h^ . 

Jd 

Thus, it is appropriate to consider the following estimate asymptotically for 
h-^0: 

Wv - ‘fWc < ^ 

This estimate leads to the following problem of complexity minimization: 

Si = ntoh^^ — ^ min, + Cih^ = 5^ . (2) 

n,h nh‘^ 

The optimal values are of the following order of magnitude: 
h\ X d, X S'!* X 

Thus, in a one-dimensional case the complexity of global estimation of the 
solution in G[D) is quadratic with respect to the estimate in L 2 {D). 

The complexity of the estimation in G[D) can be considerably reduced by 
using a dependent estimate of values of (p that provides the relation 

|(p(x) — Lp{x + /i)| < Gh, /j — t 0 

with the probability 1. In this case, instead of (2), we obtain the problem 
S'! = t min, d 2 /n G 2 h? = 5“^ , 

n,h 

where 

/r^xd, n*2^S-\ 

There are various ways of correlating the estimates in the Monte Carlo 
method (see section 1 and [1-3]). 

4 Double Randomization 

1. Various examples introducing additional randomness for constructing effec- 
tive simulation algorithms can be found in the literature devoted to the Monte 
Carlo methods (see, for example, [3]). This section is concerned with randomized 
algorithms for estimating probabilistic characteristics of equations with random 
parameters. 

Randomized estimation for the statistical moments of the solution is pre- 
sented below. Assume functional equation L4> = / to be solved by the Monte 
Carlo method on the basis of simulation of a stochastic process. (Denote the 
trajectories of this process by x). This means that random variables ^k{x>) are 
constructed so that 



M4(x) = Jft, A; = 1,2, ...,m, 
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where Jk are the functionals of 4> to be evaluated (M denotes the mathematical 
expectation). 

Let the operator L and the function / depend on a random field a (for 
example, a random medium in transfer theory, random force in elasticity theory, 
etc.). Also, 

Jk = Jk{o-) 

and 

M[6(t^,cr)k] = Jk{cr), 

where the variables oj and o are generally not independent. 

Consider the problem of evaluating the quantities 

Jk = \^Jk{ck), Rkj = '&[Jk{ck)Jj{ck)], k,j = l,...,m, 

where E denotes the mathematical expectation with respect to the distribution 
of a. 

The following obvious method is known for evaluating these mathematical ex- 
pectations. First, realizations of a are constructed: then the equation 
is solved precisely enough for each realization by a numerical or an analytical 
technique. Finally, statistical estimates of the desired quantities are calculated. 
However, this approach fails for complicated multi-dimensional problems because 
the computational cost of an explicit solution of the equation considered is too 
high. Therefore, it is useful to apply sometimes a method of ‘double randomiza- 
tion’. In our case, this technique follows from the following relations: 

KJk{(j) = EM^A;(t^,Cr) = /o\ 

where uji and c ^2 are conditionally independent trajectories, constructed for one 
fixed realization of a and the subscript of the expectation symbol indicates the 
distribution to which it corresponds. Clearly, we have to assume the existence 
of the total expectations exposed in (3), i.e.,: 

M(„,(7)|6(‘^,cr)| < Too, < +0O. 

Relations of Eq. (4.1) show that to estimate the quantities Jk it is sufficient to 
construct only one trajectory for a fixed a, while the estimation of the quan- 
tities Rkj requires two conditionally independent trajectories. To optimize the 
randomization technique, it is natural to use the ‘splitting method’ (see, for ex- 
ample, [5]). In this method, the quantities Jk are estimated as follows. First, one 
constructs n conditionally independent trajectories (i.e. a vector u> = (cj^, ..., cj„), 
with a fixed), and then a random variable 

i=l 

is used instead of ^k{k^J,u). The optimal value of n is calculated by the formula 
(see, for example, [5]) 



0-2 
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where 

tti = E(M^k)^ - Jl, &2 = , 

ti is the average computing time for a fixed realization of a, and t 2 is the average 
computing time for a fixed realization of u>. 

2 . As a particular case of using the formulae (3) it is possible to consider ran- 
domization of the collision estimate [1-3] 

N 

71=0 



where 



Oo 



I{xq) 
7t(xo) ’ 



Qn 



Qn—l 






Here {x„} is the Markov chain with functional characteristics tv{xq) and p(x',x) 
and k[x',x) is the kernel related to the integral equation cp = Kcp f. Let 
k{xn-i,Xn) , fo, hn be independent unbiased estimates of the corresponding val- 
ues x„), /(xo), h[xn) (for instance, random estimates of integrals, which 

express those values), is the corresponding unbiased estimate of the weight 
Qn and K\ is the integral operator with the kernel function E|fc(x',x)|. If 
p{Ki) < f,E|/j| G Loo,E|/| G Li, then [1] 






_ N ^ ^ 

where = X) Qn^n- Besides in [1] it is shown, that 

71=0 



}^e = {x,h[2p* -h]) + {x,Dh), 
where x is the Neumann series for the equation 



,, , f Ek'^(x'x) Ef^(x) , N 

X {x)= / — — ^x {x )dx'+ oi" X = J<'x + Hf T^), 

I p[x',x) 7r[x) ^ 



X 



ifp(K;)<f,E/v^GLi. 

If evaluating h[x) is only randomized, then 

Ee = Ee+(A,Dh), 

where y is determined as usual [1] . It is possible to show that the relation p{Kp) < 
1 implies the relation p{K[) < 1. 

3 . Eurther a class of Monte Carlo algorithms for solving large scale linear al- 
gebraic systems with dense matrixes based on randomization of matrix-vector 
multiplication is considered [6] (see so [3]). Varying the number of non-zero rows 
in random sparse matrixes involved from V to 1 [N being the number of equa- 
tions) one can proceed from deterministic successive approximations method to 
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Neumann-Ulam scheme [1] (with transition probabilities pij all equal to i/N), 
thus, the statistical error being easily controlled. The general scheme with suf- 
ficient conditions for variance boundness is considered. The method is intended 
for solving stochastic problems. 

Consider a system of linear algebraic equations 

Au = g, (4) 

here u,g £ IR^ , A is nonsingular NxN matrix. Suppose that stationary iterative 
method for solving this system can be constructed. It means that (4) can be 
transformed to 

u = Ku + / , 

with the spectral radius of K less than unity, and thus the successive approxi- 
mations 

„(«+!) ^ 

uA) given, 

converging to the unique solution of (4). The results below are independent of 
whether the elements of matrix K available in explicit form or not. 

Let SA) ^ . . . , S'!”), ... be an infinite sequence of independent realizations 

of the random matrix S such that ES* = K. Define the sequence of random 
vectors setting 

^(n+l) ^ g{n) ^{n) j ^ 

= uA) , 

Since SA) and are independent within this construction we get = uA) 

for all n. 

Let J={ji,j 2 , • • • , 4 l} be a random set of L different natural numbers less or 
equal to N , where ji is chosen with equal probabilities among all this numbers, 
is chosen with equal probabilities among the numbers remained, etc. Eor all 
first indexes i put 

f kij, if j e J, 

0 otherwise. 

Thus, the random matrix S constructed has L non-zero columns which are equal 
to corresponding columns of matrix K, multiplied by N/L. Eor all i,j 

N 

Es^j = — kij • P (E T j- = kij . 

Due to the special form of random matrix S to calculate random vector 
(see (5) one needs only L components of ^A) to be calculated on the previous 
step with, in turn, only L components of involved. Hence, the amount of 

computational work needed to calculate k components of is proportional 

to kL-\- nL'^ . If L = f then the algorithm described coincides with the standard 
Neumann-Ulam scheme (or ‘collision estimate’) with transition probabilities all 




( 5 ) 
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equal to IJN . If L = N it becomes the successive approximation method with 
no randomization at all (vectors are equal to for all n). Note, that the 
conditions of the boundness of the values are considered in [6] (see so 

[3]). 

5 Internal Parallelezation 

1 . It is clear that the use of a simple parallel summator can essentially raise 
the efficiency of the ‘global walk-on-grid’ method in solving difference equations. 
At the same time such a device supplemented with a purely arithmetic parallel 
processor computing the values of the kernel also can essentially reduce the 
cost of the general method of local estimates [1-3] in solving multi-dimensional 
integral equations; here the local estimate method becomes essentially more 
effective than the usual ‘frequency polygon’ method. The computer costs can be 
similarly decreased when solving a number of problems by using weights [1-3] and 
when numerically simulating the random fields by summarizing the independent 
realizations of the initial random functions [5]. 

2 . Now consider the simulation of a particle free-path in the medium with a 
piece- wise-constant total cross-section [1-3]. Let a set of surfaces is determined 
so that the i-th surface is the boundary only of two subdomains with numbers 
ki and k, i.e., there is the correspondence: i — ^ [ki, k), i = 1, ..., N. In each given 
subdomain a cross-section is constant. Let a particle start from a point r in the 
i-th subdomain in the direction u>, i.e., along the ray 

r(t) = r + ujt, t > 0. 

Using standard geometric algorithms (see, for instance, [1-3]), it is possible to 
calculate the corresponding distances from the point r till all the surfaces of the 
system under consideration. Obviously, a simple arithmetic parallel multiproces- 
sor system is here essentially useful. As a result we obtain the sequences: 

1 1 jU j j fri? 

/(A;(i), /(!)), (A;(2), /(O), 

where {tg} are above-mentioned distances in increasing order, and 

are the numbers of subdomains which are separated by the corre- 
sponding surfaces. Further, the sequence 

mi = j,m2,...,mn ( 6 ) 

of numbers of intersected domains can be determined by the following recursive 
procedure: 

if nig = k^^'^ then nig^i = else nig^i = k^^\ s = I, ...,n — 1. 

Note that in the case nig^i = the equality /^®^ = m^®^ has to be valid if the 
initial geometrical information is true. Using the sequence (6) we can sample 
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a free-path length [1-3]; here it is useful to previously compute in the parallel 
manner all the values 

STq (tg f 5— l)^ms 7 fo 11? ^ I,...,??-- 

If the system contains a part with a regular net of cells, then it is expedient 
to use for this part the maximal cross-section method [5], Such a combined 
simulation of a free-path length is in detail considered for the case of a hexagonal 
net in [5] , where the net is reduced to the parallelepipedal form, and the operation 
‘entier’ is used for the determination of the elementary subdomain numbers. 
Finally note, that if 

m 

^=1 

then the free-path length I can be sampled by the formula: 

/ = min(;i, 

where {Ik} are independent random free-path lengths corresponding to the cross- 
sections {ak{r)}. 

3 . When simulating the ‘walk on spheres’ process it is necessary to determine the 
distance from a given point r to the boundary of the region under consideration 
(see [1,3]). This distance is expressed by the formula 

d{r) = min(di(r), ..., dAr(r)), 

where {di(r)} are the distances form r to the elementary surfaces, which are 
the parts of the boundary. It is expedient to use here a simple multiprocessor 
system. Usually the region is artificially divided into the parts so that if a point is 
included in one of them, then it is necessary to compute only some corresponding 
distances {di(r)}. 

To simulate ‘walk on spheres’ it is necessary to sample the isotropic unit 
vector. In n-dimensional case it is expedient to use for this purpose the following 
well-known relation between the Gaussian distribution and the isotropic direc- 
tion: if rji, are standard independent Gaussian random variables, then the 

vector ff = [rji, ■■■,rfn) is isotropic [1]. 
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Abstract. The four points modihed explicit group [A4£Q) method for 
solving 2D Poisson equation was introduced by Othman and Abdullah 
[6] which was shown to be the most superior as compared to the four 
points-TDC/ and £Q methods due to Abdullah [1] and Evans et al, [4], 
respectively. These methods were found to be suitable for parallel im- 
plementation, see Evans and Yousif, [5], Yousif and Evans, [8]. In this 
paper, the implementation of the four points M£Q algorithm with the 
red black (R.B) and four colors (4C) strategies for solving the same equa- 
tion on shared memory parallel computer are presented. The experiment 
results of the test problem are included and compared with the parallel 
four points-TC/ and £T>Q algorithms. 



1 Introduction 

The parallel point iterative algorithm which incorporates the full-sweep approach 
for solving a large and sparse linear system has been implemented successfully 
by Barlow and Evans, [2], Evans, [3]. While the half-sweep approach was intro- 
duced by Abdullah [1] for the derivation of the four points £T>Q method. Since 
the £T>Q method is explicit, it is suitable to be implemented in parallel on any 
parallel computer. While, the parallel £Q and £T>Q methods have been developed 
extensively by Evans et al. [5] and Yousif, et al., [8], respectively. Eor instance, 
Yousif and Evans, [8] implemented the parallel four, six and nine points £T?Q 
methods for solving 2D Poisson equation. All the parallel either point or block 
iterative algorithms were implemented on MIMD Sequent B8000 computer sys- 
tem at Parallel Algorithm Research Center (PARC), Loughborouh University of 
Technology, United Kingdom. 

In recent year, the four points JV[£Q iterative method was derived from the 
standard five points formula with the grid spacing h and 2h, and the rotated five 
points formula. Eurthermore, the method is shown to be the most superior as 
compared to the four ])omts-£T>Q and £Q methods, see Othman and Abdullah, 
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2 Derivation of the Four Points Modified Explicit Group 
Method 

Many important physical phenomena such as the electromagnetic and the in- 
compressible potential flow fields are presented in elliptic equation. A typical 
representative is the Poisson’s equation as, 

+ (l) 

subject to the Dirichlet boundary conditions and satisfying the exact solution, 
u[x,y) = g[x,y), for (x,y) € dil, which normally resulted in a large and sparse 
linear system. Hence, the iterative method is considered as suitable approach for 
solving such a linear system. 

Let’s consider Eq. (1) on the solution domain J7 with the grid spacing h in 
both directions, Xi = xq + ih and yj = j/o + jh, for all i,j = 0, 1, . . . , n. Eq. (1) 
can be approximated at any point [xi,yj) in many ways. The discretized form 
of Eq. (1) with the finite difference approximation will results to the standard 
five points formula as, 

A+i,i + A-i,j + A,i+1 + A,i-1 - 4t>ij = (2) 

where Vi j is an approximation to the exact solution u[xi,yj) at the grid points 
[xi,yj) = [ih,jh) and fij = f[xi,yj). Eq. (1) also can be discretized using the 
same approximation formulae with the grid spacing 2h and leads to the following 
equation, 

Vi+2,j + Vi-2, j + Vi,j + 2 + ViJ-2 - iVij = Ah^fij. (3) 

Another type of approximation derived from the rotated five point approxi- 
mation can be obtained by rotating the x — y axis clockwise by 45°. Thus, rotated 
approximation for Eq. (1) become, 

A+i,i+i + + A+i,j-i + A-i,j+i - Avi^j = 2h^fig. (4) 

All the Eqs. (2), (3) and (4) have a local truncation errors of order 0[h‘^). 

Erom Eigure 1, the solution at any group of four points type • in the solution 
domain can be solved using Eq. (3) and this will result in a (4 x 4) system of 
equations, 

4-1 0 -l1 r Vi^j 1 r + A,j- 2 - 4/i^/ij 

-1 4-1 0 Vi^2,j ^ Vi^A,j + Vi^2,j-2~ 4:h'^fi+2,j 
0—1 4—1 fi+2,j + 2 A+4 ,j + 2 + A+2,j+4 — 4/j^/i+2,j + 2 

_-l 0-1 4j [ Vij^2 \ [ Wi-2,i+2 + A,j+ 4 - 4/lVi,i+2 

The Eq. (5) can be inverted and leads to a four points JV[£Q equation, 

'vi^j 1 [7212 

Vi^2,j 1 2 7 2 1 

V^+2,3 + 2 12 72 

y^,,+2 \ [2127 



A+4,j + A+2,j-2 ~ Ah Ji+2,j 

Vi+4,j + 2 + A+2J+4 - 4/lVi+2,i+2 
Vi-2, j + 2 + A,j+4 ~ Ah"^ fi j^2 




( 6 ) 
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Fig. 1. The solution domain 1? of the four points M£Q iterative method. 



whose individual explicit equations are given by, 

= ^(7Li + W2 + L3) 

fi+2y = ^(1^1 + 7^2 + L4) , . 

fj + 2 ,j + 2 = ^(-^1 + 1^2 + 

fi,j + 2 = ^(1^1 + .^2 + ’7^4) 

where, 

7-1 = t'i-2,i + t'iJ-2 - T>2 = t'i+4,i + t'i+2,j-2 - 4:h'^fi+2,j, 

7-3 = t'i+4,j + 2 + t’i+2,j+4 ~ 4/l^/i+2,j + 2 , ^4 = t'i-2,j + 2 + t’i,j+4 ~ fi,j+2, 

fFi = 2(Li + L 3 ), W 2 = 2(12 + 7-4). 

Due to the independency and large size of mesh points as notified by Othman 
and Abdullah, [6], it can theoretically save the execution time approximately 
a quarter if the iteration over the solution domain is carried out only on the 
points which undergo the process of iterations. After the convergence criteria is 
achieved, the solutions at the remaining mesh points are executed directly at once 
starting from points type □ followed by o using Eqs. (4) and (2), respectively. 
Hence, we can define the four points JV[£Q iterative method as the following 
algorithm, 

1. Group all the • points into a four points group such that the iterative eval- 
uations will only involve points within the group as shown in Figure 1. 

2. Iterate the intermediate solutions of the points within the group on the 
solution domain using the following equation, 

[7+1 + 21+2 + +3]^''^ 

1 21+1 + 7+2 + 7y4 

= ^ +1 + 21+2 + 7+3 

21+1 + +2 + 7+4 



++2,i 
++2,4+2 
+,4+2 . 
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where Li, L 2 , Ls, L 4 ,Wi and W 2 are described in Eq. (7). 

3. Implement the relaxation procedure, 





(/.+ !) 


r_ n 


(k+l) 


r -1 












A+2,4 


= LV 


A+2,i 


+ {l-uj) 


A+2,4 


Vi+2,j+2 




Vi+2,j+2 




Vi+2,j+2 


yi,j + 2 




yi,j + 2 




yi,j+2 



where u> is the relaxation factor. 

4. Check the convergence. If converge evaluate solution at the remaining points 
(i.e. □ followed by o) using, 

4-1. Vi j + + and, 

4.2. Vi j 4 4 “ 1,4 4 “ 4 ^ t’gj— 1 h 

respectively. Otherwise, repeat the iteration cycle (i.e. goto step (2)); 

5. Stop. 



3 Parallel Strategies and Implementation 

Since all groups of four points in the solution domain are identical, the data 
partitioning approach is suitable in the implementation of the method and all 
the identical tasks (i.e. groups) can be executed in parallel. Again, the static 
scheduling is employed in this implementation. 

There are several strategies of parallelizing the four points JV[£Q method 
have been investigated and only two of them produce very good results. They 
are described as follows. 



3.1 Red Black (RB) Strategy 

From Figure 2a, all the groups '1\, for alH = 1,2, ... , ([f J)^ are allocated to 
the available processors in RB ordering strategy. Then iterate all the groups in 
the following order. 



"-^ 1 , "4'2, i/3/44,i/'5/26, i/V, "J-’s/ /Tg, i/'io, I'll, "4'l2 , l'l3 , "2'l4 , i/'l5 , 1'le/ / , 



where the // indicates the synchronization point take place. In this strategy, 
there are only two stages of iterative evaluations which start from a block (I'l, 
T 2 , T 3 , T 4 , Ts, Te, T 7 , Tg) then followed by a block {Tg, Tw, Tn, Tu, T 13 , 
4 14 , l'i 5 , I'le) with synchronization points at the end of each stage to ensure 
that the updated values are used in the subsequent iterations. Each group Ti is 
assigned to one processor at a time. Every processors independently iterate on 
its own group of points and then check for its own local convergence. If this is 
not achieved, its local flags are initialized to zero and repeat the cycle. If the 
local convergence is achieved for all the processors (i.e. all local flags are set to 
one), then a global convergence test is performed. 
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2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 

a: Red black b: Four colors 



Fig. 2. a-b show the RB and 4C ordering strategies, respectively for n = 18. 



If converge globally then the solution of the remaining points in the solution 
domain are evaluated directly at once starting from points type □ followed by o 
using Eqs. (4) and (2), respectively. The direct evaluations are also executed in 
parallel. Otherwise, increased the number of iteration and repeat the iteration 
cycle. 



3.2 Four Colors (4C) Strategy 

Groups of four points Ti, for alH = 1,2, ... , are allocated to processors 

in 4C ordering strategy as shown in Figure 2b. Iterate each group Ti in the 
following order, 



/Ts,TQ,Tr,Tg/ /Tg/Tio/Tn/Tu/ /Tis/Tu, Tis,Tie / /. 



A block (i/'i, ^' 2 / 43 , i/4) is allocated first to the available processors, then after 
all the calculations in a block are completed, the synchronization point will take 
place to ensure that the updated values of each points in the group are used in 
the subsequent iteration. Then the second block (i/5, i/'s, i/7, ig) is allocated to 
the available processors followed by the third block and finally the fourth block. 
Each processor will checks for its local and global convergence, the same way as 
described in the RB strategy. 



4 Experimental Results 

All the methods described above were applied to the following equation as a 
model of problem which was used by Abdullah, [1], Evans and Biggins, [4], Evans 
and Yousif, [5], Othman and Abdullah, [ 6 ], Yousif and Evans, [ 8 ]. The model is 
defined in a unit solution domain J? and described as + Uyy = [x^ + 





Implementation of the Parallel Four Points 



485 



Table 1. The iteration numbers and maximum errors of the parallel four points-<£C/, 
£VQ and M£Q algorithms. 



n 


Methods 


Strategies 


u; 


Re no. 


Max. error 




£Q 


RB 


1.72 


72 


4.63x10“® 


26 


£vg 


HZL 


1.69 


69 


2.46x10“^ 




M£G 


RB 


1.51 


38 


2.21x10“® 




4C 


1.51 


38 


2.21x10“® 




£G 


RB 


1.84 


135 


1.25x10“® 


50 


£vg 


HZL 


1.83 


129 


6.64x10“® 




M£g 


RB 


1.71 


72 


5.28x10“® 




4C 


1.71 


72 


5.28x10“® 




£g 


RB 


1.89 


201 


5.75x10“^ 


74 


£vg 


HZL 


1.88 


198 


3.03x10“® 




M£g 


RB 


1.79 


103 


2.35x10“® 




4C 


1.79 


103 


2.35x10“® 




£g 


RB 


1.92 


280 


3.27x10“^ 


98 


£vg 


HZL 


1.91 


265 


1.72x10“® 




M£g 


RB 


1.84 


139 


1.32x10“® 




4C 


1.84 


139 


1.32x10“® 



subject to the Dirichlet boundary conditions and satisfying the exact solution 
u{x,y) = , {x,y) G d£2. 

Throughout the experiments, a tolerance of the e = 10^^'^ in the local conver- 
gence test was used. The experimental values of u> were obtained within ±0.01 
by running the program for different values of u> and choosing the one(s) that 
gave the minimum number of iterations. The experiments were carried out on 
the several mesh sizes, 26,50,74 and 98. 

As comparisons, the parallel four points-^f/ and £T?Q algorithms are imple- 
mented by using the RB and horizontal zebra line (HZL) strategies, respectively, 
see Evans and Yousif, [5], Yousif and Evans, [8]. The implementation of the par- 
allel four points JV[£Q algorithm with the RB and 4C strategies as described in 
the previous section. Table 1 lists the strategy, optimum value of oj, iteration 
numbers and maximum errors for all the methods. Table 2 shows the total ex- 
ecution time, speedup and efficiency of two different strategies for the parallel 
four points-Ad^f/ algorithm whilst in Table 3 shows the total execution time, 
speedup and efficiency of all the algorithms. The temporal performance of the 
parallel four points JV[£Q algorithm with two different strategies was plotted 
and shown in Eigure 3. 



5 Summary 

The results obtained in Table 1 have shown that the parallel four points JV[£Q 
algorithm with both strategies produce good performance as indicated by the 
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Table 2. The total execution time, speedup and efficiency of the RB and 4C strategies 
for the parallel four points- algorithm. 





No. 


RB strategy 




4C strategy 






proc. 


Time 


Speedup 


Ell. 


Time 


Speedup 


Ell. 




1 


0.4892 


1.0000 


1.0000 


0.5712 


1.0000 


1.0000 




2 


0.3232 


1.5132 


0.7566 


0.3783 


1.5096 


0.7548 


26 


3 


0.2410 


2.0297 


0.6766 


0.3007 


1.8992 


0.6331 




4 


0.2290 


2.1355 


0.5339 


0.2883 


1.9810 


0.4953 




5 


0.2040 


2.3972 


0.4794 


0.2786 


2.0498 


0.4099 




1 


3.3535 


1.0000 


1.0000 


3.4750 


1.0000 


1.0000 




2 


1.9481 


1.7214 


0.8607 


2.0335 


1.7088 


0.8544 


50 


3 


1.5046 


2.2288 


0.7429 


1.5909 


2.1842 


0.7280 




4 


1.2462 


2.6909 


0.6727 


1.3818 


2.5148 


0.6287 




5 


1.0259 


3.2687 


0.6537 


1.1151 


3.1162 


0.6232 




1 


10.2211 


1.0000 


1.0000 


11.2431 


1.0000 


1.0000 




2 


5.6746 


1.8012 


0.9006 


6.3113 


1.7814 


0.8907 


74 


3 


4.5797 


2.2318 


0.7439 


5.0533 


2.2249 


0.7616 




4 


3.3514 


3.0498 


0.7625 


3.7929 


2.9642 


0.7410 




5 


2.8063 


3.6421 


0.7242 


3.1346 


3.5867 


0.7173 




1 


25.4072 


1.0000 


1.0000 


26.0720 


1.0000 


1.0000 




2 


13.6590 


1.8601 


0.9300 


14.0254 


1.8589 


0.9295 


98 


3 


9.8203 


2.5872 


0.8624 


10.1011 


2.5811 


0.8604 




4 


7.6705 


3.3123 


0.8281 


7.9514 


3.2789 


0.8197 




5 


6.5450 


3.8819 


0.7764 


6.7519 


3.8614 


0.7723 



0.16 



0.14 



0.12 



0) 

Q. 

5 0.08 

Q. 

E 

0) 

I- 

0.06 



0.04 



0.02 











4-MEG(RB strategy) 


.A' 




— 






4-MEG(4C strategy) 












A 




,1.^ 








1 2 


3 4 5 



No. of processors 



0.16 

0.14 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 



Fig. 3. Temporal performances of the four points-A4<£C/ algorithms with the RB and 
4C strategies for n = 98. 
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Table 3. The total execution time, speedup and efficiency of the RB and 4C strategies 
for the parallel four points- algorithm. 





No. 




£Q 






£vg 






M£G 




n 


proc. 


Time 


Speedup 


Eff. 


Time 


Speedup 


Eff. 


Time 


Speedup 


Eff. 




1 


3.3578 


1.0000 


1.0000 


1.6083 


1.0000 


1.0000 


0.4892 


1.0000 


1.0000 




2 


1.8864 


1.7800 


0.8900 


0.9137 


1.7601 


0.8800 


0.3232 


1.5132 


0.7566 


26 


3 


1.6913 


2.4360 


0.8120 


0.7208 


2.2310 


0.7437 


0.2410 


2.0297 


0.6766 




4 


1.0677 


3.1448 


0.7862 


0.5509 


2.9192 


0.7298 


0.2290 


2.1355 


0.5339 




5 


0.9560 


3.5120 


0.7024 


0.4855 


3.3121 


0.6624 


0.2040 


2.3972 


0.4794 




1 


24.7121 


1.0000 


1.0000 


11.9701 


1.0000 


1.0000 


3.3535 


1.0000 


1.0000 




2 


13.1447 


1.8800 


0.9400 


6.4959 


1.8427 


0.9214 


1.9481 


1.7214 


0.8607 


50 


3 


9.9030 


2.4954 


0.8318 


5.0679 


2.3619 


0.7873 


1.5046 


2.2288 


0.7429 




4 


8.5260 


3.1984 


0.7996 


3.7137 


3.2232 


0.8085 


1.2462 


2.6909 


0.6727 




5 


6.2920 


3.9275 


0.7855 


3.0976 


3.8642 


0.7724 


1.0259 


3.2687 


0.6537 




1 


80.9406 


1.0000 


1.0000 


38.0573 


1.0000 


1.0000 


10.2211 


1.0000 


1.0000 




2 


43.0283 


1.8811 


0.9401 


20.4488 


1.8611 


0.9306 


5.6746 


1.8012 


0.9006 


74 


3 


30.9051 


2.6190 


0.8730 


15.1991 


2.5039 


0.8346 


4.5797 


2.2318 


0.7439 




4 


23.6157 


3.4274 


0.8568 


11.8230 


3.2189 


0.8047 


3.3514 


3.0498 


0.7625 




5 


19.8476 


4.0781 


0.8156 


9.7313 


3.9108 


0.7822 


2.8063 


3.6421 


0.7242 




1 


205.0506 


1.0000 


1.0000 


97.1684 


1.0000 


1.0000 


25.4072 


1.0000 


1.0000 




2 


104.4336 


1.9597 


0.7989 


49.9837 


1.8944 


0.9472 


13.6590 


1.8601 


0.9300 


98 


3 


73.7539 


2.7802 


0.9267 


37.0179 


2.6249 


0.8749 


9.8203 


2.5872 


0.8624 




4 


60.5976 


3.3838 


0.8459 


28.6531 


3.3912 


0.8478 


7.6705 


3.3123 


0.8281 




5 


49.8785 


4.1110 


0.8222 


24.1652 


4.0210 


0.8042 


6.5450 


3.8819 


0.7764 



number of iteration and maximum errors. However, the total execution time of 
the algorithm with the RB strategy is slightly faster than the 4C strategy as 
shown in Table 2. It is also indicated in the temporal performance graph plotted 
in Figure 3. This is due to the fact that the RB strategy required less number 
of synchronization for every completed iterative cycle as compared to the 4C 
strategy. 

In Table 3 and Figure 4, we found that the total execution time of the parallel 
four points JV[£Q algorithm regardless of the number of processors is faster than 
the parallel four points-^!/ and £T>Q algorithms. This is because the number 
of mesh points which undergoes the iterative evaluations are approximately a 
quarter over the total mesh points in the solution domain. In view of this, we 
found that the speedup and efficiency of the parallel JV[£Q algorithm is not as 
good as the other two algorithms and it can be improved by increasing the size 
of mesh points in the solution domain. Additionally in Figure 5, the temporal 
performance of the parallel JV[£Q algorithm has shown the highest values as 
compared to the other algorithms. In other words, the parallel four points JV[£Q 
algorithm is the most superior and effective method among the three algorithms 
particularly for solving 2D Poisson equation. 
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No. of processors 



Fig. 4. Total execution time versus no. of processors of the four points-TC/, £T>Q and 
M£Q algorithms when n = 98. 




No. of processors 



Fig. 5. Temporal performances of the four points-TC/, £T>Q and M£Q methods when 
n = 98. 



It can be summarized that the parallel four points JV[£Q algorithm with the 
RB strategy is the most superior among the three algorithms as the size of mesh 
points getting larger. In the future, the algorithm will be implemented on the 
networked of workstations (Anderson, et al., [7]) and the paper will be reported 



soon. 
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Abstract. This paper describes the Ulcluster software tool, which par- 
titions Expressed Sequence Tag (EST) sequences and other genetic se- 
quences into “clusters” based on sequence similarity. Ideally, each cluster 
will contain sequences that all represent the same gene. If a naiVe ap- 
proach such as an NxN comparison (A is the number of sequences input) 
is taken, the problem is only feasible for very small data sets. Ulcluster 
has been developed over the course of four years to solve this problem ef- 
ficiently and accurately for large data sets consisting of tens or hundreds 
of thousands of EST sequences. The latest version of the application has 
been parallelized using the MPI (message passing interface) standard. 
Both the computation and memory requirements of the program can be 
distributed among multiple (possibly distributed) UNIX processes. 



1 Introduction 

Clustering is the process of taking a set of elements and partitioning them into 
meaningful groups. In the high throughput gene sequencing activities of our lab- 
oratories, we generate large numbers of short sequences - Expressed Sequence 
Tags (ESTs) - and partition them into sets based on similarity. The importance 
of this problem bears on several aspects, but the principal of these are creating 
non-redundant indices of genes and assessing the novelty of sequencing. If done in 
a naive fashion, such as a NxN comparison, this problem would be intractable 
for the data set sizes we produce (50K-300K ESTs). Although there are sev- 
eral existing software system [7, 5, 1,6] available that perform sequence clustering 
accurately, our program is unique in its ability to efficiently and accurately clus- 
ter EST sequences. Over the past four years, we have developed techniques to 
speed up the computation by using increasingly sophisticated heuristics along 
with parallel processing techniques. The usefulness of our program, Ulcluster, 
has been demonstrated in the identification of more than 100,000 unique/novel 
clusters across three species (human, mouse, and rat). 
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2 Expressed Sequence Tags (ESTs) 

From a biological perspective, ESTs are partial transcripts of genes. Specifi- 
cally, they are sequenced from cDNA (complementary DNA) clones, synthesized 
from polyA-selected whole-cell RNA. To prepare for EST sequencing, mRNA 
molecules are extracted from cells and converted into cDNA through reverse 
transcription. The cDNAs are then cloned into a vector and electroporated into 
bacteria for growth, amplification, and storage. A collection of such cDNAs is 
referred to as a library. Each cDNA library potentially contains many unique 
and previously undiscovered genes. However, significant redundancy within a 
library (multiple copies of the same mRNA) and between libraries is normal. 

High throughput EST sequencing for gene identification involves sequencing 
the 3’ end of randomly chosen cDNA clones from a cDNA library. The use of 
a poly-T primer during reverse transcription allows for the preferential creation 
of cDNAs with a poly-A tail at their 3’ ends. Thus, sequencing can start from a 
known position (within poly-A tail). 

For the purposes of this paper, and from the computational perspective, an 
EST is a character string made up of letters from the alphabet A, C, T, G, 
X, N where A, C, T, and G represent the four nucleotide bases of DNA and X 
and N represent bases within repetitive (low-complexity) segments or that are of 
indeterminate identity. ESTs are typically between 400 to 1000 letters, or bases, 
long. Gomparing pairs of ESTs and looking for similarity is the basic element 
of clustering. This comparison is complex because the underlying sequencing 
technology is error prone - bases can be inserted, deleted, or misread. Studies 
of our EST sequences have indicated that the error rate for EST sequencing is 
approximately 5% for misread errors, and 1-2% for insertion/deletion errors. 

3 Uses of Clustering 

Glustering is used to assess the gene discovery rate of sequencing done from 
cDNA libraries. For single library assessment, the entire set of ESTs obtained 
from that library is used as a input for clustering. Glustering partitions the set 
into subsets, or clusters, based on similarity. Each EST is a member of at most 
one cluster. Novelty is computed as the number of clusters identified divided by 
the number of sequences clustered. 

This computation is used to calculate both incremental and overall novelty 
rates (roughly corresponding to gene discovery rates) for individual cDNA li- 
braries and for EST projects as a whole. Incremental novelty calculations are 
performed daily to monitor the sequencing efforts and to determine when cDNA 
library subtractions should occur [2]. This procedure can dramatically increase 
novelty rates. However, the subtraction process is time consuming and cannot 
be performed on a continual basis. 

Figure 1 shows an example of the effectiveness of these procedures for a 
progression of four cDNA libraries, named GO, Gf, G2p, and G3. Each sharp 
increase in novelty rates corresponds to a subtraction on the preceding library 
being performed. 
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Fig. 1. Incremental library novelty 



Another significant use of clustering is the generation of non-redundant gene 
indices, or UniGene sets [7], As mentioned previously, ideally each cluster will 
uniquely represent a gene. Thus, the goal in constructing a UniGene set is to 
bring together all of the ESTs sequenced for a given gene into a single clus- 
ter. This information is useful for reducing redundant processing and for the 
annotation of EST sequences. 

4 Program Evolution 

Ulcluster has evolved as our laboratory’s processing requirements have in- 
creased. Three generations of the clustering program have been developed to 
date. The first revision was developed to work well for moderately sized data 
sets of ESTs. As our data sets grew, this version required more than a days 
computation time to cluster the entire set of ESTs. The main goal of the second 
version of the program was improved performance for large data sets. A third, 
parallelized version provided higher performance and several additional features 
has recently been released. All revisions of Ulcluster may be freely obtained 
from our project web site (http://genome.uiowa.edu). 

The basic clustering program flow proceeds as follows: 1) read one sequence 
from the input file, 2) compare the sequence against every existing cluster, 3) 
based on sequence similarity, either add it to an existing cluster or make it the 
first member of a new cluster. This process is repeated until every sequence 
in the input file is examined. In step 3, the EST is only added to an existing 
cluster if the specified similarity criteria is met. The similarity criteria is run- 
time configurable and is of the form A out of M bases. Eor example, 38 out 
of 40 bases would mean two sequences are judged to be similar if there is at 
least one window of 38 out of 40 bases in common, allowing insertion, deletion, 
and mismatch errors. The speed of the program is directly effected by these 
parameters. Higher error tolerance [M — N) increases program execution time 
significantly as does larger window sizes {M). 
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Sequence 
Hashes ; 



GCCACTTGGCGTTTTG 




Hash 


1 


GCCACTTG 


= 48406 


Hash 


2 


CCACTTGG 


= 44869 


Hash 


3 


CACTTGGC 


= 27601 


Hash 


4 


ACTTGGCG 


= 39668 


Hash 


5 


CTTGGCGT 


= 59069 


. . .etc 







Fig. 2. Example of hashing a sequence 



4.1 Revision 1.0 

Revision 1.0 was useful for relatively small data sets (< 30,000 ESTs). The pro- 
gram was structured so that clusters were stored in a 2-D linked list. Each EST 
read from the input file was compared against a single representative element 
from each cluster. The longest EST from each cluster was used as a representa- 
tive element for that cluster. 

Evaluating the N of M similarity criteria for two sequences is computation- 
ally intensive. As a performance optimization, we used a hashing technique to 
eliminate comparisons that will obviously be unsuccessful (i.e., the A of M cri- 
teria will not be met). A hash is simply an integer that uniquely represents a 
short string of characters. The general equation used to generate a hash is given 

by (1). 

C-i 

F = (1) 

i=0 

In this equation, H is the generated hash value, ( is the string length, K is 
the alphabet size, and ct>i is the integer value assigned to the letter at position 
i in the string being hashed. The string length () that can be used to generate 
hashes is limited by the word size of the computer. Eor the DNA alphabet, each 
base requires 2-bits to represent it (|~log 2 A] where K = A for DNA). Thus, the 
maximum value of C using a single word on a 32-bit machine is 16. 

When a sequence is hashed, equation 1 is used on every length sub-string. 
Eigure 2 shows the first six hashes generated for a sample sequence with C = 8. 

When an EST is clustered, the N of M similarity criteria is only evaluated 
for cluster representatives that contain one or more hashes in common with 
the EST being clustered. The length of the hash probe used is an important 
parameter that can significantly affect performance. Longer hash lengths will 
result in better performance for a given similarity criteria. It must also be chosen 
carefully so that potential similarities are not missed. The formula for calculating 
the maximum hash size is shown in (2). The rational for this equation is that for 
any chosen similarity criteria N of M , there is at least one contiguous, error-free 
region of bases. Thus, the comparison of two sequences can be accelerated 
by first searching for short exact matches of length bases between the pair 
(i.e. searching for identical hashes). If such a match is found, a more exhaustive 
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Fig. 3. Global hash table 



search that permits errors can be performed. If no length ( hashes are identified, 
then the two sequences cannot possibly contain a window of M bases with N 
bases in common. 



^ LM -TV + lJ ' ' 

The calculation to generate the hashes for a sequence is only performed once 
since the hash lists are stored in memory. However, the hashes are accessed 
many times during the programs execution. This amortizes the computational 
overhead of generating the hashes. 



4.2 Revision 2.0 

The main improvement in revision 2.0 was the implementation of the global 
hash table (GHT). As our EST data sets grew larger, the sequential nature of 
the traversal of the cluster representative linked list for every input sequence 
became a bottleneck. The GHT optimizes the program at a higher level than 
individual sequence comparisons by filtering the entire search space of cluster 
representatives into a subset of high-potential candidate targets. 

When a new sequence is clustered, a list of hashes is generated for each ( 
base window of its sequence. Each hash in the list is then used as an index into 
the GHT. Eigure 3 shows a GHT with 4® elements, corresponding to (^ = 8. 
Each element in the table points to a linked list of clusters that contain at 
least one occurrence of the hash equal to its index. In figure 3, there are three 
clusters that contain the hash 2. If the sequence being clustered also has a hash 
of two, the touch count field of each cluster linked from the second element in 
the GHT is incremented. If the touch count field of a cluster exceeds a run-time 
configurable threshold, a detailed sequence comparison is performed between the 
input sequence and the candidate cluster. This procedure is based on the premise 
that two similar sequences will likely have many hashes in common. 

Gare must be taken to adjust the touch count threshold appropriately. Eor 
a given similarity criteria (e.g. 38 out of 40 bases) and hash length (, if the 
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Fig. 4. Execution time of Revision 1.0 vs. Revision 2.0 



threshold is too low the speedup due to the GHT will be small. Conversely if 
the threshold is too high, some sequence similarities will be missed. 

Figure 4 shows the execution time for both revisions of the clustering program 
with an input data set of 80,766 rat EST sequences. Revision 2.0 demonstrates 
28x speedup while calculating virtually identical results. The major trade-off of 
the GHT optimization is memory utilization. However, on a 2GB machine we 
have been able to cluster data sets as large as 1 million ESTs. While theoretically 
the first revision could handle data sets this long, the computation time required 
would make it impractical. 

4.3 Revision 3.0 

The latest version of the clustering program has been parallelized to split up 
the computational and memory requirements across several computers (com- 
pute nodes). The main reasons for doing this are for added performance and so 
that the program can scale to larger problem sizes without being constrained 
by the memory limitations of a single computer. The MPI (message passing 
interface) [4] communication standard has been used for inter-process commu- 
nication. 

In this mode of execution, each cluster is stored on exactly one compute node. 
A given sequence is read in from the input file and processed in parallel on each 
compute node. This results in a parallel search of the cluster space. Once each 
node has finished its search, each node’s best match is collectively communicated 
to all compute nodes. The node with the best match stores the sequence in its 
memory space. If no match is found on any of the compute nodes, the input 
sequence becomes a new cluster and is assigned to one of the compute nodes. 
Glusters are balanced evenly across the compute nodes. 

Figure 5 illustrates the parallel speedup obtained for clustering a data set 
of approximately 81,000 rat EST sequences. The three curves represent three 
different runs of the program using different parameter sets. The first curve 
(labeled 1) corresponds to the default parameters used in our processing pipeline. 
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Fig. 5. Parallel speedup 



The second curve (labeled 2) adds the extended search option. By default, an 
EST is added to the first cluster it found to be similar to and the search is halted. 
This option enables the identification of all similar cluster representatives for 
each EST clustered. The third curve (labeled 3) enables the reverse complement 
checking option of the program. 

Since the implementation uses a collective communication at the end of every 
sequence clustered, the amount of computation required for each sequence is 
important. As the grain size increases, better performance should be observed 
since relatively less communication is being performed. 

The times in minutes for the single and 8 node run for each case are shown 
in the figure. Performance scales poorly for the first case, actually decreasing 
when using two compute nodes. This is most likely due to the computation be- 
ing unevenly distributed and the communication overhead. With more compute 
nodes, performance increases somewhat but is never greater than double that of 
the serial case. The larger grain size of the second case results in significantly 
improved speedup. The third curve scales similarly since the grain size is only 
slightly increased for this case. 



5 Conclusion 

The evolution of an EST clustering program has been discussed in this ex- 
tended abstract. Background information on the problem has been presented 
along with details of two sequential implementations and a parallel implemen- 
tation. Planned extensions to Ulcluster include utilizing the recently released 
human genome sequence [3,8] to improve the accuracy of clustering, and to aid 
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in identification of alternative splice forms and intron/exon boundaries. Other 
extensions planned include improved performance for long sequences (e.g., full 
length cDNA sequences), automatic cluster merging, and tools for manual cura- 
tion of clustering results by expert human operators. 

References 

1. Adams M.D., Kerlavage A.R., Fleishmann R.D., Fuldner R.A., Bult C.J., Lee N.H., 
Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., et al. (1995) Initial as- 
sessment of human gene diversity and expression patterns based upon 83 million 
nucleotides of cDNA sequence. Nature 377:3-17 

2. Bonaldo M.F., Lennon G., Soares M.B. (1996) Normalization and subtraction: two 
approaches to facilitate gene discovery. Genome Research 6:791-806 

3. International Human Genome Sequencing Consortium (2001) Initial sequencing and 
analysis of the human genome. Nature 409:860-921 

4. Message Passing Interface Form (1994) MPl: A message-passing interface standard. 
University of Tennessee Technical Report CS-94-230 

5. Miller R.T., Christoffels A.G., Gopalakrishnan C., Burke J.A., Ptitsyn A. A., 
Broveak T.R., Hide W.A. (1999) A comprehensive approach to clustering of ex- 
pressed human gene sequence: The Sequence Tag Alighment and Consensus Knowl- 
edgebase. Genome Research 9:1143-1155 

6. Parsons J.D., Brenner S., Bishop M.J. (1992) Clustering cDNA Sequences. Compu- 
tational Applications in Bioscience 8:461-466 

7. Schuler G.D. (1997) Pieces of the puzzle: expressed sequence tags and the catalog 
of human genes. Journal of Molecular Medicine 75:694-698 

8. Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., et 
al. (2001) The sequence of the human genome. Science 291:1304-1351 




Protein Sequence Comparison 
on the Instruction Systolic Array 



Bertil Schmidt^, Heiko Schroder and Manfred Schimmler^ 

* School of Computer Engineering, Nanyang Technological University, Singapore 639798 
{asbschmidt , asheiko}@ntu . edu . sg 
^ Institut fiir Datenverarbeitungsanlagen, TU Braunschweig, 
Hans-Sommer-Str. 66, 38 106 Braunschweig, Germany 
masch@ida . ing . tu-bs . de 



Abstract. Molecular biologists frequently compare an unknown protein 
sequence with a set of other known sequences (a database scan) to detect 
functional similarities. Even though efficient dynamic programming algorithms 
exist for the problem, the required scanning time is still very high, and because 
of the exponential database growth finding fast solutions is of highest 
importance to research in this area. In this paper we present a new approach to 
biosequence database scanning on the instruction systolic array to gain high 
performance at low cost. To derive an efficient mapping onto this architecture, 
we designed a fine-grained parallel sequence comparison algorithm. This results 
in an implementation with significant runtime savings on Systola 1024, a 
parallel computer of this particular architecture. 



1 Introduction 

Scanning protein sequence databases is a common and often repeated task in 
molecular biology. The need for speeding up this treatment comes from the 
exponential growth of the biosequence banks: every year their size scaled by a factor 
1.5 to 2. The scan operation consists in finding similarities between a particular query 
sequence and all the sequences of a bank. This operation allows biologists to point out 
sequences sharing common subsequences. From a biological point of view, it leads to 
identify similar functionality. 

Comparison algorithms whose complexities are quadratic with respect to the length 
of the sequences detect similarities between the query sequence and a subject 
sequence. One frequently used approach to speed up this time consuming operation is 
to introduce heuristics in the search algorithms [1]. The main drawback of this 
solution is that the more time efficient the heuristics, the worse is the quality of the 
results [15]. 

Another approach to get high quality results in a short time is to use parallel 
processing. There are two basic methods of mapping the scanning of protein sequence 
databases to a parallel processor: one is based on the systolisation of the sequence 
comparison algorithm, the other is based on the distribution of the computation of 
pairwise comparisons. Systolic arrays have been proven as a good candidate structure 
for the first approach [3,8,16], while more expensive supercomputers and networks of 
workstations are suitable architectures for the second [6,13]. 

Special-purpose systolic arrays provide the best price/performance ratio by means 
of mnning a particular algorithm [10]. Their disadvantage is the lack of flexibility 
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with respect to the implementation of different algorithms. Instruction systolic arrays 
(ISAs) have been developed in order to combine the speed and simplicity of systolic 
arrays with flexible programmability [11]. Originally, the main application field of 
ISAs was supposed to be scientific computing. However, in the mid 90s the suitability 
of the ISA architecture for other applications was recognised, e.g. [5, 17-21]. In this 
paper we illustrate how an ISA can be used for efficient biosequence database 
scanning. We designed a parallel algorithm for sequence comparisons that is tailored 
towards the capabilities of the ISA. This leads to a high-speed implementation on 
Systola 1024, a parallel computer of this particular architecture. 

This paper is organised as follows. In Section 2, we introduce the basic sequence 
comparison algorithm for database scanning and highlight previous work in parallel 
sequence comparison. Section 3 provides a description of the ISA concept as well as 
the Systola 1024 architecture. The new parallel algorithm and its mapping onto the 
parallel architecture are explained in Section 4. The performance is evaluated and 
compared to previous implementations in Section 5. Section 6 concludes the paper 
with an outlook to further research topics. 



2 Parallel Sequence Comparison 



Surprising relationships have been discovered between protein sequences that have 
little overall similarity but in which similar subsequences can be found. In that sense, 
the identification of similar subsequences is probably the most useful and practical 
method for comparing two sequences. The Smith- Waterman algorithm [22] finds the 
most similar subsequences of two sequences (the local alignment) by dynamic 
programming. 

The algorithm compares two sequences by computing a distance that represents the 
minimal cost of transforming one segment into another. Two elementary operations 
are used: substitution and insertion/deletion (also called a gap operation). Through 
series of such elementary operations, any segments can be transformed into any other 
segment. The smallest number of operations required to change one segment into 
another can be taken into as the measure of the distance between the segments. 

Consider two strings 51 and 52 of length Z1 and 12. To identify common 
subsequences, the Smith-Waterman algorithm computes the similarity H{ij) of two 
sequences ending at position i and j of the two sequences 51 and 52. The computation 
of H{iJ) is given by the following recurrences: 



H (i, j) = max 

E(i, j) = max 
F(i, j) = max 



0 

H(i-l,j-l) + Sbt(Sl„S2/) 



j<l2 

,0<t<Zl,l< j<l2 
,1<Z<Z1,1< j<l2 
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where Sbt is a character substitution cost table. Initialisation of these values are given 
by: 

' E{i,Q) ' 0, ,0 J; Jn 

: F(0,;) : 0 ,0J7U12 



Multiple gap costs are taken into account as follows: 3 is the cost of the first gap; # 
is the cost of the following gaps. Fig. 1 illustrates an example with gap costs 3 = 1 
and # = 1 and Sbt defined as: 

if : y) 



Sbt{x, y) 



#2 

V 



otherwise 



Each position of the matrix // is a similarity value. The two segments of 51 and 52 
producing this value can be determined by a backtracking procedure (see Fig. 1). 
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Fig. 1. Example of the Smith-Waterman algorithm to compute the local alignment between two 
DNA sequences ATCTCGTATGATG and GTCTATCAC. The matrix H{iJ) is shown for the 
computation with gap costs 3 = 1 and # = 1 , and a suhstitution cost of +2 if the characters are 
identical and 1 otherwise. From the highest score (+10 in the example), a tracehack procedure 
delivers the corresponding alignment (shaded cells), the two subsequences TCGTATGA and 
TCTATCA. 

The dynamic programming calculation can be efficiently mapped to a linear array 
of processing elements. A common mapping is to assign one processing element (PE) 
to each character of the query string, and then to shift a subject sequence systolically 
through the linear chain of PEs (see Fig. 2). If l\ is the length of the first sequence and 
12 is the length of the second, the comparison is performed in 11+12 1 steps on 11 PEs, 
instead of I18d2 steps required on a sequential processor. In each step the computation 
for each dynamic programming cell along a single diagonal in Fig. 1 is performed in 
parallel. 

A number of parallel architectures have been developed for sequence analysis. In 
addition to architectures specifically designed for sequence analysis, existing 
programmable sequential and parallel architectures have been used for solving 
sequence problems. 

Special-purpose systolic arrays can provide the fastest means of running a 
particular algorithm with very high PE density. However, they are limited to one 
single algorithm, and thus cannot supply the flexibility necessary to run a variety of 
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algorithms required analyzing DNA, RNA, and proteins. P-NAC was the first such 
machine and computed edit distance over a four-character alphabet [14]. More recent 
examples, better tuned to the needs of computational biology, include BioScan, BISP, 
and SAMBA [3,8,16]. 

query sequence 







Fig. 2. Sequence comparison on a linear processor array: the query sequence is loaded into the 
processor array (one character per PE) and a subject sequence flows from left to right through 
the array. During each step, one elementary matrix computation is performed in each PE. 



Reconfigurable systems are based on programmable logic such as field- 
programmable gate arrays (FPGAs) or custom-designed arrays. They are generally 
slower and have far lower PE densities than special-purpose architectures. They are 
flexible, but the configuration must be changed for each algorithm, which is generally 
more complicated than writing new code for a programmable architecture. Splash-2 
and Biocellerator are based on FPGAs, while MGAP and PIM have their own 
reconfigurable designs [2,7,9,10]. 

Our approach is based on instruction systolic arrays (ISAs). ISAs combine the 
speed and simplicity of systolic arrays with flexible programmability [11], i.e. they 
achieve a high performance cost ratio and can at the same time be used for a wide 
range of applications, e.g. scientific computing, image processing, multimedia video 
compression, computer tomography, volume visualisation and cryptography [5,17- 
21]. The Kestrel design presented in [4] is close to our approach since it is also a 
programmable fine-grained parallel architecture. Unfortunately, its topology is purely 
a linear array (compared to a mesh in our approach). This has limited so far its 
widespread usage to biosequence searches and a computational chemistry application. 



3 ISA Concept and Systola 1024 

The ISA [11] is a mesh-connected processor grid, where the processors are controlled 
by three streams of control information: instructions, row selectors, and column 
selectors (see Fig. 3). The instructions are input in the upper left comer of the 
processor array, and from there they move step by step in horizontal and vertical 
direction through the array. This guarantees that within each diagonal of the array the 
same instruction is active during each clock cycle. In clock cycle k+\ processor 
and (ij+l) execute the instmction that has been executed by processor (ij) in 
clock cycle k. 

The selectors also move systolically through the array: the row selectors 
horizontally from left to right, column selectors vertically from top to bottom. 
Selectors mask the execution of the instractions within the processors, i.e. an 
instmction is executed if and only if both selector bits, currently in that processor, are 
equal to one. Otherwise, a no-operation is executed. 
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Fig. 3. Control flow in an ISA 



Every processor has read and write access to its own memory. Besides that, it has a 
designated communication register (C-register) that can also be read by the four 
neighbour processors. Within each clock phase reading access is always performed 
before writing access. Thus, two adjacent processors can exchange data within a 
single clock cycle in which both processors overwrite the contents of their own C- 
register with the contents of the C-register of its neighbour. This convention avoids 
read/write conflicts and also creates the possibility to perform aggregate functions 
within one instruction (or a constant number of instructions). 

Aggregate functions on a processor array are associative and commutative 
functions to which every processor provides an argument value. As they are 
commutative and associative, aggregate functions can be evaluated in many different 
ways (orders). The ISA supports top-down column operations and left-right row 
operations, due to the systolic flow of the instructions. Thus, an aggregate function 
can be implemented on the ISA by executing it firstly in all columns, placing the 
corresponding results within the last processor within each column, and secondly 
applying the aggregate function to these results in the last row, executing it within the 
last row (left-to-right). Simple examples of aggregate functions are the sum of all and 
the maximum of all. Other important operations that can be executed particularly well 
on the ISA are row broadcast (left-to-right), column broadcast (top-down) and 
ringshift operations. These are the key operations within the algorithm presented in 
this paper and hence they are explained below. 

Row broadcast: Each processor reads the value from its left neighbour. Since the 
execution of this operation is pipelined along the row, the same value is propagated 
from one communication register to the next, until it finally arrives at the rightmost 
processor. Note that the row broadcast requires only a single instruction. 

Row ringshift: The contents of the C-registers can be ringshifted along the 
processor rows by two instructions. Every two horizontally adjacent processors 
exchange data (using one read left and one read right operation). Because of the 
instruction flow from west to east this implements a ringshift. Of course, a column 
ringshift can be executed in the same way. 

Systola 1024 is a low cost add-on board for standard PCs [12]. The ISA on the 
board is integrated on a 4x4 array of processor chips. Each chip contains 64 
processors, arranged as an 8x8 square. This provides 1024 processors on the board. 
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In order to exploit the computation capabilities of this unit, a cascaded memory 
concept is implemented on Systola 1024 that forms a fast input and output 
environment for the parallel processing unit. For the fast data exchange with the ISA 
there are rows of intelligent memory units at the northern and western borders of the 
array called interface processors (IPs). Each IP is connected to its adjacent array 
processor for data transfer in each direction. The IPs have access to an on-board 
memory by means of special fast data channels, those at the northern interface chips 
with the northern board RAM, and those of the western chips with the western board 
RAM. The northern and the western board RAM can communicate bidirectionally 
with the PC memory over the PCI bus. The data transfer between every two memory 
units within this hierarchy is controlled by an on-board controller chip (see Fig. 4). 

At a clock frequency of /= 50 MHz and using a word format of m=16 bits each 
(bitserial) processor can execute flm = 50/16-10*^ = 3.12510*^ word operations per 
second. Thus, one board with its 1024 processors performs up to 3.2 GIPS. 



Systola 1024 board architecture 




western IP 



ISA 



Fig. 4. Data paths in Systola 1024 



4 Mapping of Sequence Comparison onto the ISA 

Systolic parallelisation of the Smith- Waterman algorithm on a linear processor array 
is well-known (see Section 2). In order to extend this algorithm to a mesh- 
architecture, we take advantage of ISAs capabilities to perform row broadcast and 
row ringshift in a very efficient way (see Section 3). Since the length of the sequences 
may vary (several thousands in some cases, however commonly the length is only in 
hundreds), the computation must also be partitioned on the NxN ISA. For sake of 
clarity we firstly assume the processor array size to be equal to the query sequence 
length M, i.e. M-N^. 
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western IPs 
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Fig. 5. (a) Data flow for aligning two sequences A and B on an M=NxN ISA: A is loaded into 
the ISA one character per PE and B is completely shifted through the array in M+K-l steps. 
Each character b. is input from the lower western IP and results are written into the upper 
western IP. (b) For the computation of E{iJ), and F{iJ), the values and 

b. are received from the neighbouring PE (according to the data flow in (a)), while 
E{iJ-l), a., a, (3, and Sbt(a.,b.) are stored locally. 

Fig. 5a shows the data flow in the ISA for aligning the sequences A = 
and B = where A is the query sequence and 5 is a subject sequence of the 

database. As a preprocessing step, symbol a., i = 0,...,M— 1, is loaded into PE (m,n) 
with m = N—i div N—\ and n = N-i mod N—\ and B is loaded into the lower western 
IP. After that the row of the substitution table corresponding to the respective 
character is loaded into each PE as well as the constants a and p. B is then completely 
shifted through the array in M+K—\ steps as displayed in Eig. 5a. 

In iteration step k, \ <k < M+K—\, the values H(iJ), E(iJ), and F(iJ) for all i, j 
with \ <i<M, \ <j <K and k=i+j—\ are computed in parallel in the PEs (m,n) with m 
= N—i div N—\ and n = N-i mod N—\. Eor this calculation PE (m,n) receives the values 
H(i—l,j), F(i—l,j), and b. from its eastern neighbour (m,n+l) if n < N—1, or from PE 
(m+1,0) if n = N—1 and m < N—1, while the values H(i—l,j—l), H(iJ—l), F(iJ—l), a_., a, 
P, and Sbt{a^,b) are stored locally (see Fig 5b). The lower right PE {N—l,N—\) 
receives b. in steps j with 0 < j < K—1 from the lower western IP and zeros otherwise. 

Because of the efficient row ringshift and row broadcast, these routing operations 
can be accomplished in constant time on the ISA. Thus, it takes M+K—1 steps to 
compute the alignment cost of the two sequences with the Smith-Waterman 
algorithm. However, notice that after the last character of B enters the array, the first 
character of a new subject sequence can be input for the next iteration step. Thus, all 
subject sequences of the database can be pipelined with only one step delay between 
two different sequences. Assuming k sequences of length K and K = 0(M), we 
compute K sequence alignments in time 0(KM) using 0(M) processors. As the best 
sequential algorithm takes O(KM^) steps, our parallel implementation achieves 
maximal efficiency. 

Because of the very limited memory of each PE, only the highest score of matrix H 
is computed on Systola 1024 for each pairwise comparison (see Fig. 1). Ranking the 
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compared sequences and reconstructing the alignments are carried out by the front 
end PC. Because this last operation is only performed for very few subject sequences, 
its computation time is negligible. In our ISA algorithm the maximum computation of 
the matrix H can be easily incorporated with only a constant time penalty: After each 
iteration step all PEs compute a new value max by taking the maximum of the newly 
computed //-value and the old value of max from its neighbouring PE. After the last 
character of a subject sequence has been processed in PE (0,0), the maximum of 
matrix H is stored in PE (0,0), which is written into the adjacent western IP. 

So far we have assumed a processor array equal in size of the query sequence 
length (M=N^). In practice, this rarely happens. Assuming a query sequence length of 
M = k-N with k a multiple of A or A a multiple of k, the algorithm is modified as 
follows: 

1. k < N: In this case we can just replicate the algorithm for a kxN ISA on an 
AxA ISA, i.e. each kxN subarray computes the alignment of the same query sequence 
with different subject sequences. 

2. k > N: A possible solution is to assign kJN characters of the sequences to 
each PE instead of one. However, in this case the memory size has to be sufficient to 
store kIN rows of the substitution table (20 values per row, since there are 20 different 
amino acids), i.e. on Systola 1024 it is only possible to assign maximally two 
characters per PE. Thus, for kJN > 2 it is required to split the sequence comparison 
into k/{2N) passes: 

The first 2A^ characters of the query sequence are loaded into the ISA. The entire 
database then crosses the array; the //-value and E’-value computed in PE (0,0) in each 
iteration step are written into the adjacent western IP and then stored in the western 
board RAM. In the next pass the following 2A^ characters of the query sequence are 
loaded. The data stored previously is loaded into the lower western IP together with 
the corresponding subject sequences and from there sent again into the ISA. The 
process is iterated until the end of the query sequence is reached. Note that, no 
additional instructions are necessary for the I/O of the intermediate results with the 
processor array, because it is integrated in the dataflow (see Pig. 5a). The additionally 
required data transfer between IPs and board RAM can be performed concurrently 
with the computation (see Section 5 for more details). 

To achieve even higher performance we mapped the database scanning application 
on a cluster of 16 Systola 1024 boards (see Pig. 6). The cluster consists of 16 PCs 
(Pentium II 450) connected via a Gigabit-per-second LAN (using Myrinet M2F- 
PCI32 as network interface cards and Myrinet M2L-SW16 as a switch). For parallel 
application development we use the MPI library MPICH v. 1.1.2. 

For distributing of the computation among the PCs we have chosen a static split 
load balancing strategy: A similar sized subset of the database is assigned to each PC 
in a preprocessing step. The subsets remain stationary regardless of the query 
sequence. Thus, the distribution has only to be performed once for each database and 
does not influence the overall computing time. The input query sequence is broadcast 
to each PC and multiple independent subset scans are performed on each Systola 1024 
board. Finally, the highest scores are accumulated in one PC. 

This strategy provides the best performance for our homogenous architecture, 
where each processing unit has the same processing power. However, a dynamic split 
load balancing strategy as used in [13] is more suitable for heterogeneous 
environments. 
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Fig. 6. Architecture of a hybrid parallel system: A coarse-grained cluster of 16 PCs with 
Systola 1024 PCI boards 



5 Performance Evaluation 

A performance measure commonly used in computational biology is millions of 
dynamic cell updates per second (MCUPS). A CUPS represents the time for a 
complete computation of one entry of the matrix H, including all comparisons, 
additions and maxima computations. To measure the MCUPS performance on Systola 
1024, we have given the instruction count to update two //-cells per PE in Table 1. 



Table 1. Instruction count to update two //-cells in one PE of Systola 1024 with the 
corresponding operations. 



Operation in each PE per iteration step 


Instruction Count 


Get H(i-lJ), F(i-lJ), b , max^ .from neighbour 


20 


Lookup Sbt(a,,b) in internal memory 


16 


Compute t = max{0, + Sbt(a,b)] 


4 


Compute F(i,j) = max{ f(i-l, 7 )-lJ} 


8 


Compute E(i,j) = max {//(//- l)-cx, El/Z-ll-lJ} 


8 


Compute //(//) = max{t, F(iJ), E(iJ)} 


8 


Compute max = max\ Fl(iJ), max ^ , } 


4 


Sum 


68 



Because new //-values are computed for two characters within 68 instruction in 
each PE, the whole 32x32 processor array can perform 2048 cell updates in the same 
time. This leads to a performance of 
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^^x/CUPS = ^^x— xlO*^ CUPS = 94 MCUPS 
68 68 16 

Because MCUPS does not consider data transfer time and query length, it is often a 
weak measure that does not reflect the behaviour of the complete system. Therefore, 
we will use execution times of database scans for different query lengths in our 
evaluation. 

The involved data transfer in each iteration step is: input of a new character b. into 
the lower western IP of each kxN subarray for query lengths < 2048 (case 1. in 
Section 4) and input of a new b. and a previously computed cell of H and F and output 
of an //-cell and T’-cell from the upper western IP for query lengths > 2048 (case 2. in 
Section 4). Thus, the data transfer time is totally dominated by above computing time 
of 68 instructions per iteration step. 



Table 2. Scan times (in seconds) of TrEMBL 14 for various length of the query sequence on 
Systola 1024, a PC cluster with 16 Systola 1024, and a Pentium III 600. The speed up 
compared to the Pentium III is also reported. 



Query sequence length 


256 


512 


1024 


2048 


4096 


Systola 1024 




577 


1137 


2241 


4611 


speed up 


wbm 


6 


6 


6 


6 


PC Cluster of 16 


20 


38 


73 


142 


290 


Systolas 
speed up 


81 


86 


91 


94 


92 


Pentium III 600 MHz 


1615 


3286 


6611 


13343 


26690 



Table 2 reports times for scanning the TrEMBL protein databank (release 14, 
which contains 351 ’834 sequences and 100’069’442 amino acids) for query 
sequences of various lengths with the Smith-Waterman algorithm. The first two rows 
of the table give the execution times for Systola 1024 and the cluster with 16 boards 
compared to a sequential C-program on a Pentium III 600. As the times show, the 
parallel implementations scale almost linearly with the sequence length. Because of 
the used static split strategy the cluster times scale also almost linearly with the 
number of PCs. A single Systola 1024 board is 5-6 times faster than a Pentium III 
600. However, a board redesign based on technology used for processors such as the 
Pentium III (Systola 1024 has been built in 1994 [12]) would make this factor 
significantly higher. 

Fig. 7 shows time measurements of sequence comparison with the Smith- 
Waterman algorithms on different parallel machines. The data for the other machines 
is taken from [4]. Systola 1024 is around two times faster than the much larger IK-PE 
MasPar and the cluster of 16 Systolas is around two times faster than a 16K-PE 
MasPar. The 1 -board Kestrel is 4-5 times faster than a Systola board. Kestrel’s design 
[4] is also a programmable fine-grained parallel architecture implemented as a PC 
add-on board. It reaches the higher performance, because it has been built with 0.5- 
|j,m CMOS technology, in comparison to 1.0- (tm for Systola 1024. Extrapolating to 
this technology both approaches should perform equally. However, the difference 
between both architectures is that Kestrel is purely a linear array, while Systola is a 
mesh. This makes the Systola 1024 a more flexible design, suitable for a wider range 
of applications, see e.g. [5, 17-21]. 
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□ 512 

□ 1024 

□ 2048 



IK-PE Systola 1024 ICestrel 16K-PE Systola 
IVbsPar IVbsPar Quster 



Fig. 7. Time comparison for a lOMbase search with the Smith -Waterman algorithm on 
different parallel machines for different query lengths. The values for 1 K-PE MasPar, Kestrel, 
and 16K-PE MasPar are taken from [4], while the values for Systola are based on the TrEMBL 
14 scanning times (see Table 2) divided by a normalisation factor of 10. 



6 Conclusions and Future Work 

In this paper we have demonstrated that the ISA concept is very suitable for scanning 
biosequence databases. We have presented the design of an ISA algorithm that leads 
to a high-speed implementation on Systola 1024 exploiting the fine-grained 
parallelism inherent to the sequence comparison problem. By additionally using a 
coarse-grained distribution of the database within a cluster of Systola 1024, we can 
achieve supercomputer performance at low cost. 

The exponentially growth of genomic databases demands even more powerful 
parallel solutions in the future. Because comparison and alignment algorithms that are 
favoured by biologists are not fixed, programmable parallel solutions are required to 
speed up these tasks. As an alternative to special-purpose systems, hard-to-program 
reconfigurable systems, and expensive supercomputers, we advocate the use of 
specialised yet programmable hardware whose development is tuned to system speed. 

Our future work in parallel computing will include identifying more applications 
that profit from this type of processing power consisting of a combination of fine- 
grained and coarse-grained parallelism, like scientific computing and multimedia 
video processing. The results of this study will influence our design decision to build 
a next-generation Systola board consisting of one large 128x128 ISA or of a cluster of 
16 32x32 ISAs. 
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1 Introduction 

When designing high voltage equipment like power transformers, it is of essen- 
tial importance to precisely and efficiently calculate eddy-current problems in 
a transformer to determine possible losses. A method suitable for such simula- 
tions is the Boundary-Element method (BEM) [2]. As far as the simulation is 
concerned, for electrical devices operating continuously under alternating cur- 
rent, time-harmonic states are of interest. These lead to an elliptic transmis- 
sion problem for the eddy-current Maxwell equations. With some modifications, 
the linear equation system resulting from the boundary element discretization 
is well-conditioned. Eor realistic problems, however, the discretization leads to 
very large, non-symmetric systems of linear equations. To deal with such large 
equation systems, iterative solution techniques such as GMRES [9] must be em- 
ployed. However, for certain combinations of materials occurring in electrical 
engineering (such as e.g. iron and copper parts) the parallel boundary integral 
equation system and its discretizations are ill-conditioned, primarily caused by 
the physical parameters in the problem formulation. In order to cope with such 
problems, the Seminar for Applied Mathematics at the ETH Zurich, Switzer- 
land has developed a preconditioner for the eddy-current system of second kind 
Boundary Integral Equations [4] which has been integrated into the framework 
of the boundary element field simulation code POLOPT [1]. Eor this code, it 
is important to deploy a network with both high bandwidth and low latency 
like the Scalable Coherent Interface (SCI) [6] achieving significantly improved 
performance over standard Ethernet. With such a cluster installed at ABB Cor- 
porate Research, Heidelberg, it is now possible to perform realistic eddy current 
calculations in a shorter amount of time. 

2 Physical Background and Simulation Process 

Eddy currents are often generated in transformers and cause power losses and 
heat problems. Earaday’s Law implies that a changing flux produces an induced 
electric field even in empty space. Inserting a metal plate into this empty space 
produces electric currents (eddy currents) in the metal. If the induced currents 
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are created by a changing magnetic field, the eddy currents will be perpendicular 
to the magnetic field. By constructing a transformer core of alternating layers of 
conducting and nonconducting materials, the size of the induced loops is reduced 
which in turn reduces the energy loss. 

Mathematically, the conventional approach to the calculation of eddy cur- 
rents is based on the formulation of the boundary value problem with respect to 
a vector magnetic potential. This is justified for two-dimensional problems when 
the vector magnetic potential has only one component. In three dimensional 
problem space, however, the potential is a three dimensional vector like the field 
itself. Thus, Maxwell’s equations should be used directly, with respect to field 
vectors. As can be seen in [4], this eventually yields a system of second kind 
boundary integral equations on a conductor surface by using the so called H — (f> 
formulation [7]. When applying the Boundary Element Method (BEM) for these 
equations, a linear equation system is obtained. However, this system is still 
ill-conditioned and can hence not be solved with the General Minimal Residual 
Method (GMRES) [9], which is the solver being used in POLOPT. Therefore, a 
preconditioner is applied in [4] so that after preconditioning the equation system 
can be solved with GMRES: Eirst, a coefficient matrix is generated by POLOPT 
which can be done in parallel since the generation of a line is independent from 
any other line. Then the parallel preconditioner and solver described in [4] are 
applied. Typical sizes for the equation systems are in the range of 3-4-5 orders 
of magnitude with densely populated coefficient matrices. 



3 High Performance Cluster Computing 

In the last few years, clusters built from commodity-of-the-shelf (GOTS) com- 
ponents have become increasingly popular and have found their way from pure 
research use to industrial production environments. Existing specialized clus- 
ter interconnects, such as Myrinet[3], and SGI [6] provide a considerably higher 
bandwidth and much lower latencies. All of them are based on the principle of 
user-level communication, enabling applications to directly benefit from the im- 
proved communication performance leading to a significantly improved overall 
performance comparable to expensive tightly coupled multi-processors, but at 
a fraction of their cost. 

In this work, PG clusters interconnected using the Scalable Goherent In- 
terface (SGI) are used. SGI is an IEEE standardized [6] state-of-the-art SAN 
technology allowing for link speeds of up to 667 MB /s and a process-to-process 
communication bandwidth of up to 85 MB/s and latencies less than 2 /xs. 

In order to allow applications to exploit SGI’s capabilities, several projects 
aim at providing both standard communication libraries like MPI [8] and low- 
level communication mechanisms like Sockets in a way fully exploiting the under- 
lying hardware capabilities and avoiding excessive protocol stacks, like TGP/IP. 
Typically over 90% of the raw performance can be achieved using these ap- 
proaches. 
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In 1999, ABB Corporate Research Center in Heidelberg, Germany installed 
a cluster of 8 LINUX based 500MHz Pentium-HI class PCs connected via both 
Fast Ethernet and SCI. Each node is equipped with 1 GB of physical memory. For 
the parallel version of POLOPT, the SCI-based MPI implementation ScaMPI 
[5] from SCALI AS has been used. It is fully MPI 1.1 specification compliant and 
highly optimized for SCI-based architectures. Its raw performance on the target 
architecture used for this work is around 5 /xs in MPI end-to-end latency and 
80 MB /s in bandwidth, which is about 94 % of the raw performance of SCI on the 
ABB setup. This shows that ScaMPI enables applications to directly leverage on 
the high performance of the underlying interconnection fabric without the high 
protocol overhead visible in traditional systems. 

4 Practical Examples and Results 

The transformer that has 
served as a benchmark- 
ing example is depicted 
in Figure 1. The task is 
to calculate the distribu- 
tion of power losses 
caused by eddy currents 
in the yoke clamping 
plates in order to detect 
possible temperature hot 
spots. To guide the mag- 
netic flux and to manipu- 
late the loss distribution, 
the yoke is extended by a 
so called flux plate. The 
yoke clamps are modeled 
as solid parts. The full 
model consisting of the 
magnetic core, the yoke 
clamps, the flux plate 
and the HV windings has 
been analyzed with all materials assumed as linear. The obtained peak values 
were 10.071 kA for the Low Voltage winding (30 turns) and 482.175kA for the 
High Voltage winding (1 turn), respectively. From these values, the eddy cur- 
rents in two different parts of the yoke clamp have been calculated. The required 
number of unknowns and the resulting working set sizes can be seen in Table 1. 

The calculations have been performed on the ABB LINUX cluster. Due to 
the high permeability values of the materials, it is necessary to actually solve 
the equation system twice [4]. The computation times have been measured for 
both Fast Ethernet and SCI. As can be seen from from Table 1, a speedup of 
1.3-1. 4 is obtained by using the SCI network technology significantly reducing 
runtime. 




b) geometry to be anal3rzed 



Fig. 1. Transformer model 
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Table 1. Computation times and speedup for transformer example 



Part 


Unknowns 


Size 


Network 


Solver 1 


Speedup 


Solver 2 


Speedup 


1 


11547 


2 GB 


Fast Ethernet 


1289 sec 


1.0 


1533 sec 


1.0 


1 


11547 


2 GB 


SCI 


932 sec 


1.4 


1080 sec 


1.4 


2 


15291 


3.6 GB 


Fast Ethernet 


2636 sec 


1.0 


2843 sec 


1.0 


2 


15291 


3.6 GB 


SCI 


2025 sec 


1.3 


2179 sec 


1.3 



5 Conclusions and Outlook 

In this paper we have described the process of eddy-current simulations based 
on new algorithms developed by ETH Ziirich. Using a high bandwidth and low 
latency network like SCI, a significant speedup in the solver computation times 
has been achieved. Recently the LINUX PC cluster has been upgraded to 16 
nodes yielding a total main memory of 16GB. Therefore, it will be possible to 
compute even larger eddy current problems as the coefficient matrices will fit 
into the cluster’s main memory. 
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